[PATCH v4 00/11] VMSCAPE optimization for BHI variant

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 00/11] VMSCAPE optimization for BHI variant
@ 2025-11-20  6:17 Pawan Gupta
  2025-11-20  6:17 ` [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop() Pawan Gupta
                   ` (10 more replies)
  0 siblings, 11 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:17 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

Sean, David, since this version differs quite a bit from v3, I have not
included the Ack tag you provided. Please let me know if it is okay to add
them back.

v4:
- Move LFENCE to the callsite, out of clear_bhb_loop(). (Dave)
- Make clear_bhb_loop() work for larger BHB. (Dave)
  This now uses hardware enumeration to determine the BHB size to clear.
- Use write_ibpb() instead of indirect_branch_prediction_barrier() when
  IBPB is known to be available. (Dave)
- Use static_call() to simplify mitigation at exit-to-userspace. (Dave)
- Refactor vmscape_select_mitigation(). (Dave)
- Fix vmscape=on which was wrongly behaving as AUTO. (Dave)
- Split the patches. (Dave)
  - Patch 1-4 prepares for making the sequence flexible for VMSCAPE use.
  - Patch 5 trivial rename of variable.
  - Patch 6-8 prepares for deploying BHB mitigation for VMSCAPE.
  - Patch 9 deploys the mitigation.
  - Patch 10-11 fixes ON Vs AUTO mode.

v3: https://lore.kernel.org/r/20251027-vmscape-bhb-v3-0-5793c2534e93@linux.intel.com
- s/x86_pred_flush_pending/x86_predictor_flush_exit_to_user/ (Sean).
- Removed IBPB & BHB-clear mutual exclusion at exit-to-userspace.
- Collected tags.

v2: https://lore.kernel.org/r/20251015-vmscape-bhb-v2-0-91cbdd9c3a96@linux.intel.com
- Added check for IBPB feature in vmscape_select_mitigation(). (David)
- s/vmscape=auto/vmscape=on/ (David)
- Added patch to remove LFENCE from VMSCAPE BHB-clear sequence.
- Rebased to v6.18-rc1.

v1: https://lore.kernel.org/r/20250924-vmscape-bhb-v1-0-da51f0e1934d@linux.intel.com

Hi All,

These patches aim to improve the performance of a recent mitigation for
VMSCAPE[1] vulnerability. This improvement is relevant for BHI variant of
VMSCAPE that affect Alder Lake and newer processors.

The current mitigation approach uses IBPB on kvm-exit-to-userspace for all
affected range of CPUs. This is an overkill for CPUs that are only affected
by the BHI variant. On such CPUs clearing the branch history is sufficient
for VMSCAPE, and also more apt as the underlying issue is due to poisoned
branch history.

Below is the iPerf data for transfer between guest and host, comparing IBPB
and BHB-clear mitigation. BHB-clear shows performance improvement over IBPB
in most cases.

Platform: Emerald Rapids
Baseline: vmscape=off

(pN = N parallel connections)

| iPerf user-net | IBPB    | BHB Clear |
|----------------|---------|-----------|
| UDP 1-vCPU_p1  | -12.5%  |   1.3%    |
| TCP 1-vCPU_p1  | -10.4%  |  -1.5%    |
| TCP 1-vCPU_p1  | -7.5%   |  -3.0%    |
| UDP 4-vCPU_p16 | -3.7%   |  -3.7%    |
| TCP 4-vCPU_p4  | -2.9%   |  -1.4%    |
| UDP 4-vCPU_p4  | -0.6%   |   0.0%    |
| TCP 4-vCPU_p4  |  3.5%   |   0.0%    |

| iPerf bridge-net | IBPB    | BHB Clear |
|------------------|---------|-----------|
| UDP 1-vCPU_p1    | -9.4%   |  -0.4%    |
| TCP 1-vCPU_p1    | -3.9%   |  -0.5%    |
| UDP 4-vCPU_p16   | -2.2%   |  -3.8%    |
| TCP 4-vCPU_p4    | -1.0%   |  -1.0%    |
| TCP 4-vCPU_p4    |  0.5%   |   0.5%    |
| UDP 4-vCPU_p4    |  0.0%   |   0.9%    |
| TCP 1-vCPU_p1    |  0.0%   |   0.9%    |

| iPerf vhost-net | IBPB    | BHB Clear |
|-----------------|---------|-----------|
| UDP 1-vCPU_p1   | -4.3%   |   1.0%    |
| TCP 1-vCPU_p1   | -3.8%   |  -0.5%    |
| TCP 1-vCPU_p1   | -2.7%   |  -0.7%    |
| UDP 4-vCPU_p16  | -0.7%   |  -2.2%    |
| TCP 4-vCPU_p4   | -0.4%   |   0.8%    |
| UDP 4-vCPU_p4   |  0.4%   |  -0.7%    |
| TCP 4-vCPU_p4   |  0.0%   |   0.6%    |

[1] https://comsec.ethz.ch/research/microarch/vmscape-exposing-and-exploiting-incomplete-branch-predictor-isolation-in-cloud-environments/

---
Pawan Gupta (11):
      x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
      x86/bhi: Move the BHB sequence to a macro for reuse
      x86/bhi: Make the depth of BHB-clearing configurable
      x86/bhi: Make clear_bhb_loop() effective on newer CPUs
      x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user
      x86/vmscape: Move mitigation selection to a switch()
      x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier()
      x86/vmscape: Use static_call() for predictor flush
      x86/vmscape: Deploy BHB clearing mitigation
      x86/vmscape: Override conflicting attack-vector controls with =force
      x86/vmscape: Add cmdline vmscape=on to override attack vector controls

 Documentation/admin-guide/hw-vuln/vmscape.rst   |  8 +++
 Documentation/admin-guide/kernel-parameters.txt |  4 +-
 arch/x86/Kconfig                                |  1 +
 arch/x86/entry/entry_64.S                       | 49 +++++++++++-------
 arch/x86/include/asm/cpufeatures.h              |  2 +-
 arch/x86/include/asm/entry-common.h             |  9 ++--
 arch/x86/include/asm/nospec-branch.h            | 11 ++--
 arch/x86/kernel/cpu/bugs.c                      | 67 +++++++++++++++++++------
 arch/x86/kvm/x86.c                              |  4 +-
 arch/x86/net/bpf_jit_comp.c                     |  2 +
 10 files changed, 114 insertions(+), 43 deletions(-)
---
base-commit: 6a23ae0a96a600d1d12557add110e0bb6e32730c
change-id: 20250916-vmscape-bhb-d7d469977f2f

Best regards,
-- 
Thanks,
Pawan



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
@ 2025-11-20  6:17 ` Pawan Gupta
  2025-11-20 16:15   ` Nikolay Borisov
  2025-11-20  6:18 ` [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse Pawan Gupta
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:17 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

Currently, BHB clearing sequence is followed by an LFENCE to prevent
transient execution of subsequent indirect branches prematurely. However,
LFENCE barrier could be unnecessary in certain cases. For example, when
kernel is using BHI_DIS_S mitigation, and BHB clearing is only needed for
userspace. In such cases, LFENCE is redundant because ring transitions
would provide the necessary serialization.

Below is a quick recap of BHI mitigation options:

  On Alder Lake and newer

  - BHI_DIS_S: Hardware control to mitigate BHI in ring0. This has low
               performance overhead.
  - Long loop: Alternatively, longer version of BHB clearing sequence
	       on older processors can be used to mitigate BHI. This
	       is not yet implemented in Linux.

  On older CPUs

  - Short loop: Clears BHB at kernel entry and VMexit.

On Alder Lake and newer CPUs, eIBRS isolates the indirect targets between
guest and host. But when affected by the BHI variant of VMSCAPE, a guest's
branch history may still influence indirect branches in userspace. This
also means the big hammer IBPB could be replaced with a cheaper option that
clears the BHB at exit-to-userspace after a VMexit.

In preparation for adding the support for BHB sequence (without LFENCE) on
newer CPUs, move the LFENCE to the caller side after clear_bhb_loop() is
executed. This allows callers to decide whether they need the LFENCE or
not. This does adds a few extra bytes to the call sites, but it obviates
the need for multiple variants of clear_bhb_loop().

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S            | 5 ++++-
 arch/x86/include/asm/nospec-branch.h | 4 ++--
 arch/x86/net/bpf_jit_comp.c          | 2 ++
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ed04a968cc7d0095ab0185b2e3b5beffb7680afd..886f86790b4467347031bc27d3d761d5cc286da1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1528,6 +1528,9 @@ SYM_CODE_END(rewind_stack_and_make_dead)
  * refactored in the future if needed. The .skips are for safety, to ensure
  * that all RETs are in the second half of a cacheline to mitigate Indirect
  * Target Selection, rather than taking the slowpath via its_return_thunk.
+ *
+ * Note, callers should use a speculation barrier like LFENCE immediately after
+ * a call to this function to ensure BHB is cleared before indirect branches.
  */
 SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
@@ -1562,7 +1565,7 @@ SYM_FUNC_START(clear_bhb_loop)
 	sub	$1, %ecx
 	jnz	1b
 .Lret2:	RET
-5:	lfence
+5:
 	pop	%rbp
 	RET
 SYM_FUNC_END(clear_bhb_loop)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 08ed5a2e46a5fd790bcb1b73feb6469518809c06..ec5ebf96dbb9e240f402f39efc6929ae45ec8f0b 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -329,11 +329,11 @@
 
 #ifdef CONFIG_X86_64
 .macro CLEAR_BRANCH_HISTORY
-	ALTERNATIVE "", "call clear_bhb_loop", X86_FEATURE_CLEAR_BHB_LOOP
+	ALTERNATIVE "", "call clear_bhb_loop; lfence", X86_FEATURE_CLEAR_BHB_LOOP
 .endm
 
 .macro CLEAR_BRANCH_HISTORY_VMEXIT
-	ALTERNATIVE "", "call clear_bhb_loop", X86_FEATURE_CLEAR_BHB_VMEXIT
+	ALTERNATIVE "", "call clear_bhb_loop; lfence", X86_FEATURE_CLEAR_BHB_VMEXIT
 .endm
 #else
 #define CLEAR_BRANCH_HISTORY
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index de5083cb1d3747bba00effca3703a4f6eea80d8d..c1ec14c559119b120edfac079aeb07948e9844b8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1603,6 +1603,8 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
 
 		if (emit_call(&prog, func, ip))
 			return -EINVAL;
+		/* Don't speculate past this until BHB is cleared */
+		EMIT_LFENCE();
 		EMIT1(0x59); /* pop rcx */
 		EMIT1(0x58); /* pop rax */
 	}

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
  2025-11-20  6:17 ` [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop() Pawan Gupta
@ 2025-11-20  6:18 ` Pawan Gupta
  2025-11-20 16:28   ` Nikolay Borisov
  2025-11-25  0:21   ` Pawan Gupta
  2025-11-20  6:18 ` [PATCH v4 03/11] x86/bhi: Make the depth of BHB-clearing configurable Pawan Gupta
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:18 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

In preparation to make clear_bhb_loop() work for CPUs with larger BHB, move
the sequence to a macro. This will allow setting the depth of BHB-clearing
easily via arguments.

No functional change intended.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 37 +++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 886f86790b4467347031bc27d3d761d5cc286da1..a62dbc89c5e75b955ebf6d84f20d157d4bce0253 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1499,11 +1499,6 @@ SYM_CODE_END(rewind_stack_and_make_dead)
  * from the branch history tracker in the Branch Predictor, therefore removing
  * user influence on subsequent BTB lookups.
  *
- * It should be used on parts prior to Alder Lake. Newer parts should use the
- * BHI_DIS_S hardware control instead. If a pre-Alder Lake part is being
- * virtualized on newer hardware the VMM should protect against BHI attacks by
- * setting BHI_DIS_S for the guests.
- *
  * CALLs/RETs are necessary to prevent Loop Stream Detector(LSD) from engaging
  * and not clearing the branch history. The call tree looks like:
  *
@@ -1532,10 +1527,7 @@ SYM_CODE_END(rewind_stack_and_make_dead)
  * Note, callers should use a speculation barrier like LFENCE immediately after
  * a call to this function to ensure BHB is cleared before indirect branches.
  */
-SYM_FUNC_START(clear_bhb_loop)
-	ANNOTATE_NOENDBR
-	push	%rbp
-	mov	%rsp, %rbp
+.macro	CLEAR_BHB_LOOP_SEQ
 	movl	$5, %ecx
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
@@ -1545,15 +1537,16 @@ SYM_FUNC_START(clear_bhb_loop)
 	 * Shift instructions so that the RET is in the upper half of the
 	 * cacheline and don't take the slowpath to its_return_thunk.
 	 */
-	.skip 32 - (.Lret1 - 1f), 0xcc
+	.skip 32 - (.Lret1_\@ - 1f), 0xcc
 	ANNOTATE_INTRA_FUNCTION_CALL
 1:	call	2f
-.Lret1:	RET
+.Lret1_\@:
+	RET
 	.align 64, 0xcc
 	/*
-	 * As above shift instructions for RET at .Lret2 as well.
+	 * As above shift instructions for RET at .Lret2_\@ as well.
 	 *
-	 * This should be ideally be: .skip 32 - (.Lret2 - 2f), 0xcc
+	 * This should ideally be: .skip 32 - (.Lret2_\@ - 2f), 0xcc
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
 	.skip 32 - 18, 0xcc
@@ -1564,8 +1557,24 @@ SYM_FUNC_START(clear_bhb_loop)
 	jnz	3b
 	sub	$1, %ecx
 	jnz	1b
-.Lret2:	RET
+.Lret2_\@:
+	RET
 5:
+.endm
+
+/*
+ * This should be used on parts prior to Alder Lake. Newer parts should use the
+ * BHI_DIS_S hardware control instead. If a pre-Alder Lake part is being
+ * virtualized on newer hardware the VMM should protect against BHI attacks by
+ * setting BHI_DIS_S for the guests.
+ */
+SYM_FUNC_START(clear_bhb_loop)
+	ANNOTATE_NOENDBR
+	push	%rbp
+	mov	%rsp, %rbp
+
+	CLEAR_BHB_LOOP_SEQ
+
 	pop	%rbp
 	RET
 SYM_FUNC_END(clear_bhb_loop)

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 03/11] x86/bhi: Make the depth of BHB-clearing configurable
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
  2025-11-20  6:17 ` [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop() Pawan Gupta
  2025-11-20  6:18 ` [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse Pawan Gupta
@ 2025-11-20  6:18 ` Pawan Gupta
  2025-11-20 17:02   ` Nikolay Borisov
  2025-11-20  6:18 ` [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs Pawan Gupta
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:18 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

The BHB clearing sequence has two nested loops that determine the depth of
BHB to be cleared. Introduce an argument to the macro to allow the loop
count (and hence the depth) to be controlled from outside the macro.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a62dbc89c5e75b955ebf6d84f20d157d4bce0253..a2891af416c874349c065160708752c41bc6ba36 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1527,8 +1527,8 @@ SYM_CODE_END(rewind_stack_and_make_dead)
  * Note, callers should use a speculation barrier like LFENCE immediately after
  * a call to this function to ensure BHB is cleared before indirect branches.
  */
-.macro	CLEAR_BHB_LOOP_SEQ
-	movl	$5, %ecx
+.macro	CLEAR_BHB_LOOP_SEQ outer_loop_count:req, inner_loop_count:req
+	movl	$\outer_loop_count, %ecx
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
 	jmp	5f
@@ -1550,7 +1550,7 @@ SYM_CODE_END(rewind_stack_and_make_dead)
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
 	.skip 32 - 18, 0xcc
-2:	movl	$5, %eax
+2:	movl	$\inner_loop_count, %eax
 3:	jmp	4f
 	nop
 4:	sub	$1, %eax
@@ -1573,7 +1573,7 @@ SYM_FUNC_START(clear_bhb_loop)
 	push	%rbp
 	mov	%rsp, %rbp
 
-	CLEAR_BHB_LOOP_SEQ
+	CLEAR_BHB_LOOP_SEQ 5, 5
 
 	pop	%rbp
 	RET

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
                   ` (2 preceding siblings ...)
  2025-11-20  6:18 ` [PATCH v4 03/11] x86/bhi: Make the depth of BHB-clearing configurable Pawan Gupta
@ 2025-11-20  6:18 ` Pawan Gupta
  2025-11-21 12:33   ` Nikolay Borisov
                     ` (2 more replies)
  2025-11-20  6:18 ` [PATCH v4 05/11] x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user Pawan Gupta
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:18 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
the Branch History Buffer (BHB). On Alder Lake and newer parts this
sequence is not sufficient because it doesn't clear enough entries. This
was not an issue because these CPUs have a hardware control (BHI_DIS_S)
that mitigates BHI in kernel.

BHI variant of VMSCAPE requires isolating branch history between guests and
userspace. Note that there is no equivalent hardware control for userspace.
To effectively isolate branch history on newer CPUs, clear_bhb_loop()
should execute sufficient number of branches to clear a larger BHB.

Dynamically set the loop count of clear_bhb_loop() such that it is
effective on newer CPUs too. Use the hardware control enumeration
X86_FEATURE_BHI_CTRL to select the appropriate loop count.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a2891af416c874349c065160708752c41bc6ba36..6cddd7d6dc927704357d97346fcbb34559608888 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1563,17 +1563,20 @@ SYM_CODE_END(rewind_stack_and_make_dead)
 .endm
 
 /*
- * This should be used on parts prior to Alder Lake. Newer parts should use the
- * BHI_DIS_S hardware control instead. If a pre-Alder Lake part is being
- * virtualized on newer hardware the VMM should protect against BHI attacks by
- * setting BHI_DIS_S for the guests.
+ * For protecting the kernel against BHI, this should be used on parts prior to
+ * Alder Lake. Although the sequence works on newer parts, BHI_DIS_S hardware
+ * control has lower overhead, and should be used instead. If a pre-Alder Lake
+ * part is being virtualized on newer hardware the VMM should protect against
+ * BHI attacks by setting BHI_DIS_S for the guests.
  */
 SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
 	mov	%rsp, %rbp
 
-	CLEAR_BHB_LOOP_SEQ 5, 5
+	/* loop count differs based on CPU-gen, see Intel's BHI guidance */
+	ALTERNATIVE __stringify(CLEAR_BHB_LOOP_SEQ 5, 5),  \
+		    __stringify(CLEAR_BHB_LOOP_SEQ 12, 7), X86_FEATURE_BHI_CTRL
 
 	pop	%rbp
 	RET

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 05/11] x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
                   ` (3 preceding siblings ...)
  2025-11-20  6:18 ` [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs Pawan Gupta
@ 2025-11-20  6:18 ` Pawan Gupta
  2025-11-20  6:19 ` [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch() Pawan Gupta
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:18 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

With the upcoming changes x86_ibpb_exit_to_user will also be used when BHB
clearing sequence is used. Rename it cover both the cases.

No functional change.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/include/asm/entry-common.h  | 6 +++---
 arch/x86/include/asm/nospec-branch.h | 2 +-
 arch/x86/kernel/cpu/bugs.c           | 4 ++--
 arch/x86/kvm/x86.c                   | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index ce3eb6d5fdf9f2dba59b7bad24afbfafc8c36918..c45858db16c92fc1364fb818185fba7657840991 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -94,11 +94,11 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 	 */
 	choose_random_kstack_offset(rdtsc());
 
-	/* Avoid unnecessary reads of 'x86_ibpb_exit_to_user' */
+	/* Avoid unnecessary reads of 'x86_predictor_flush_exit_to_user' */
 	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) &&
-	    this_cpu_read(x86_ibpb_exit_to_user)) {
+	    this_cpu_read(x86_predictor_flush_exit_to_user)) {
 		indirect_branch_prediction_barrier();
-		this_cpu_write(x86_ibpb_exit_to_user, false);
+		this_cpu_write(x86_predictor_flush_exit_to_user, false);
 	}
 }
 #define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index ec5ebf96dbb9e240f402f39efc6929ae45ec8f0b..df60f9cf51b84e5b75e5db70713188d2e6ad0f5d 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -531,7 +531,7 @@ void alternative_msr_write(unsigned int msr, u64 val, unsigned int feature)
 		: "memory");
 }
 
-DECLARE_PER_CPU(bool, x86_ibpb_exit_to_user);
+DECLARE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
 
 static inline void indirect_branch_prediction_barrier(void)
 {
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d7fa03bf51b4517c12cc68e7c441f7589a4983d1..1e9b11198db0fe2483bd17b1327bcfd44a2c1dbf 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -113,8 +113,8 @@ EXPORT_PER_CPU_SYMBOL_GPL(x86_spec_ctrl_current);
  * be needed to before running userspace. That IBPB will flush the branch
  * predictor content.
  */
-DEFINE_PER_CPU(bool, x86_ibpb_exit_to_user);
-EXPORT_PER_CPU_SYMBOL_GPL(x86_ibpb_exit_to_user);
+DEFINE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
+EXPORT_PER_CPU_SYMBOL_GPL(x86_predictor_flush_exit_to_user);
 
 u64 x86_pred_cmd __ro_after_init = PRED_CMD_IBPB;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9c2aa6f4705e1ae257bf94572967a5724a940a7..60123568fba85c8a445f9220d3f4a1d11fd0eb77 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11397,7 +11397,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 * may migrate to.
 	 */
 	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER))
-		this_cpu_write(x86_ibpb_exit_to_user, true);
+		this_cpu_write(x86_predictor_flush_exit_to_user, true);
 
 	/*
 	 * Consume any pending interrupts, including the possible source of

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch()
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
                   ` (4 preceding siblings ...)
  2025-11-20  6:18 ` [PATCH v4 05/11] x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user Pawan Gupta
@ 2025-11-20  6:19 ` Pawan Gupta
  2025-11-21 14:27   ` Nikolay Borisov
  2025-11-20  6:19 ` [PATCH v4 07/11] x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier() Pawan Gupta
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:19 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

This ensures that all mitigation modes are explicitly handled, while
keeping the mitigation selection for each mode together. This also prepares
for adding BHB-clearing mitigation mode for VMSCAPE.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/kernel/cpu/bugs.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 1e9b11198db0fe2483bd17b1327bcfd44a2c1dbf..233594ede19bf971c999f4d3cc0f6f213002c16c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -3231,17 +3231,31 @@ early_param("vmscape", vmscape_parse_cmdline);
 
 static void __init vmscape_select_mitigation(void)
 {
-	if (!boot_cpu_has_bug(X86_BUG_VMSCAPE) ||
-	    !boot_cpu_has(X86_FEATURE_IBPB)) {
+	if (!boot_cpu_has_bug(X86_BUG_VMSCAPE)) {
 		vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
 		return;
 	}
 
-	if (vmscape_mitigation == VMSCAPE_MITIGATION_AUTO) {
-		if (should_mitigate_vuln(X86_BUG_VMSCAPE))
+	if ((vmscape_mitigation == VMSCAPE_MITIGATION_AUTO) &&
+	    !should_mitigate_vuln(X86_BUG_VMSCAPE))
+		vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
+
+	switch (vmscape_mitigation) {
+	case VMSCAPE_MITIGATION_NONE:
+		break;
+
+	case VMSCAPE_MITIGATION_IBPB_ON_VMEXIT:
+	case VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER:
+		if (!boot_cpu_has(X86_FEATURE_IBPB))
+			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
+		break;
+
+	case VMSCAPE_MITIGATION_AUTO:
+		if (boot_cpu_has(X86_FEATURE_IBPB))
 			vmscape_mitigation = VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER;
 		else
 			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
+		break;
 	}
 }
 

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 07/11] x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier()
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
                   ` (5 preceding siblings ...)
  2025-11-20  6:19 ` [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch() Pawan Gupta
@ 2025-11-20  6:19 ` Pawan Gupta
  2025-11-21 12:59   ` Nikolay Borisov
  2025-11-20  6:19 ` [PATCH v4 08/11] x86/vmscape: Use static_call() for predictor flush Pawan Gupta
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:19 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

indirect_branch_prediction_barrier() is a wrapper to write_ibpb(), which
also checks if the CPU supports IBPB. For VMSCAPE, call to
indirect_branch_prediction_barrier() is only possible when CPU supports
IBPB.

Simply call write_ibpb() directly to avoid unnecessary alternative
patching.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/include/asm/entry-common.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index c45858db16c92fc1364fb818185fba7657840991..78b143673ca72642149eb2dbf3e3e31370fe6b28 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,7 +97,7 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 	/* Avoid unnecessary reads of 'x86_predictor_flush_exit_to_user' */
 	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) &&
 	    this_cpu_read(x86_predictor_flush_exit_to_user)) {
-		indirect_branch_prediction_barrier();
+		write_ibpb();
 		this_cpu_write(x86_predictor_flush_exit_to_user, false);
 	}
 }

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 08/11] x86/vmscape: Use static_call() for predictor flush
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
                   ` (6 preceding siblings ...)
  2025-11-20  6:19 ` [PATCH v4 07/11] x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier() Pawan Gupta
@ 2025-11-20  6:19 ` Pawan Gupta
  2025-11-20  6:19 ` [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation Pawan Gupta
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:19 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

Adding more mitigation options at exit-to-userspace for VMSCAPE would
usually require a series of checks to decide which mitigation to use. In
this case, the mitigation is done by calling a function, which is decided
at boot. So, adding more feature flags and multiple checks can be avoided
by using static_call() to the mitigating function.

Replace the flag-based mitigation selector with a static_call(). This also
frees the existing X86_FEATURE_IBPB_EXIT_TO_USER.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/Kconfig                     | 1 +
 arch/x86/include/asm/cpufeatures.h   | 2 +-
 arch/x86/include/asm/entry-common.h  | 7 +++----
 arch/x86/include/asm/nospec-branch.h | 3 +++
 arch/x86/kernel/cpu/bugs.c           | 5 ++++-
 arch/x86/kvm/x86.c                   | 2 +-
 6 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a2d50eaf5f922bc8cd4e08a284045..066f62f15e67e85fda0f3fd66acabad9a9794ff8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2706,6 +2706,7 @@ config MITIGATION_TSA
 config MITIGATION_VMSCAPE
 	bool "Mitigate VMSCAPE"
 	depends on KVM
+	select HAVE_STATIC_CALL
 	default y
 	help
 	  Enable mitigation for VMSCAPE attacks. VMSCAPE is a hardware security
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 4091a776e37aaed67ca93b0a0cd23cc25dbc33d4..02871318c999f94ec8557e5fb0b8fb299960d454 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -496,7 +496,7 @@
 #define X86_FEATURE_TSA_SQ_NO		(21*32+11) /* AMD CPU not vulnerable to TSA-SQ */
 #define X86_FEATURE_TSA_L1_NO		(21*32+12) /* AMD CPU not vulnerable to TSA-L1 */
 #define X86_FEATURE_CLEAR_CPU_BUF_VM	(21*32+13) /* Clear CPU buffers using VERW before VMRUN */
-#define X86_FEATURE_IBPB_EXIT_TO_USER	(21*32+14) /* Use IBPB on exit-to-userspace, see VMSCAPE bug */
+/* Free */
 #define X86_FEATURE_ABMC		(21*32+15) /* Assignable Bandwidth Monitoring Counters */
 #define X86_FEATURE_MSR_IMM		(21*32+16) /* MSR immediate form instructions */
 
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 78b143673ca72642149eb2dbf3e3e31370fe6b28..783e7cb50caeb6c6fc68e0a5c75ab43e75e37116 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -4,6 +4,7 @@
 
 #include <linux/randomize_kstack.h>
 #include <linux/user-return-notifier.h>
+#include <linux/static_call_types.h>
 
 #include <asm/nospec-branch.h>
 #include <asm/io_bitmap.h>
@@ -94,10 +95,8 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 	 */
 	choose_random_kstack_offset(rdtsc());
 
-	/* Avoid unnecessary reads of 'x86_predictor_flush_exit_to_user' */
-	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) &&
-	    this_cpu_read(x86_predictor_flush_exit_to_user)) {
-		write_ibpb();
+	if (unlikely(this_cpu_read(x86_predictor_flush_exit_to_user))) {
+		static_call_cond(vmscape_predictor_flush)();
 		this_cpu_write(x86_predictor_flush_exit_to_user, false);
 	}
 }
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index df60f9cf51b84e5b75e5db70713188d2e6ad0f5d..15a2fa8f2f48a066e102263513eff9537ac1d25f 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -540,6 +540,9 @@ static inline void indirect_branch_prediction_barrier(void)
 			    :: "rax", "rcx", "rdx", "memory");
 }
 
+#include <linux/static_call_types.h>
+DECLARE_STATIC_CALL(vmscape_predictor_flush, write_ibpb);
+
 /* The Intel SPEC CTRL MSR base value cache */
 extern u64 x86_spec_ctrl_base;
 DECLARE_PER_CPU(u64, x86_spec_ctrl_current);
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 233594ede19bf971c999f4d3cc0f6f213002c16c..cbb3341b9a19f835738eda7226323d88b7e41e52 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -200,6 +200,9 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
 DEFINE_STATIC_KEY_FALSE(cpu_buf_vm_clear);
 EXPORT_SYMBOL_GPL(cpu_buf_vm_clear);
 
+DEFINE_STATIC_CALL_NULL(vmscape_predictor_flush, write_ibpb);
+EXPORT_STATIC_CALL_GPL(vmscape_predictor_flush);
+
 #undef pr_fmt
 #define pr_fmt(fmt)	"mitigations: " fmt
 
@@ -3274,7 +3277,7 @@ static void __init vmscape_update_mitigation(void)
 static void __init vmscape_apply_mitigation(void)
 {
 	if (vmscape_mitigation == VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER)
-		setup_force_cpu_cap(X86_FEATURE_IBPB_EXIT_TO_USER);
+		static_call_update(vmscape_predictor_flush, write_ibpb);
 }
 
 #undef pr_fmt
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 60123568fba85c8a445f9220d3f4a1d11fd0eb77..7e55ef3b3203a26be1a138c8fa838a8c5aae0125 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11396,7 +11396,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 * set for the CPU that actually ran the guest, and not the CPU that it
 	 * may migrate to.
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER))
+	if (static_call_query(vmscape_predictor_flush))
 		this_cpu_write(x86_predictor_flush_exit_to_user, true);
 
 	/*

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
                   ` (7 preceding siblings ...)
  2025-11-20  6:19 ` [PATCH v4 08/11] x86/vmscape: Use static_call() for predictor flush Pawan Gupta
@ 2025-11-20  6:19 ` Pawan Gupta
  2025-11-21 14:18   ` Nikolay Borisov
  2025-11-21 14:23   ` Nikolay Borisov
  2025-11-20  6:20 ` [PATCH v4 10/11] x86/vmscape: Override conflicting attack-vector controls with =force Pawan Gupta
  2025-11-20  6:20 ` [PATCH v4 11/11] x86/vmscape: Add cmdline vmscape=on to override attack vector controls Pawan Gupta
  10 siblings, 2 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:19 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

IBPB mitigation for VMSCAPE is an overkill on CPUs that are only affected
by the BHI variant of VMSCAPE. On such CPUs, eIBRS already provides
indirect branch isolation between guest and host userspace. However, branch
history from guest may also influence the indirect branches in host
userspace.

To mitigate the BHI aspect, use clear_bhb_loop().

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 Documentation/admin-guide/hw-vuln/vmscape.rst |  4 ++++
 arch/x86/include/asm/nospec-branch.h          |  2 ++
 arch/x86/kernel/cpu/bugs.c                    | 30 ++++++++++++++++++++-------
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/hw-vuln/vmscape.rst b/Documentation/admin-guide/hw-vuln/vmscape.rst
index d9b9a2b6c114c05a7325e5f3c9d42129339b870b..dc63a0bac03d43d1e295de0791dd6497d101f986 100644
--- a/Documentation/admin-guide/hw-vuln/vmscape.rst
+++ b/Documentation/admin-guide/hw-vuln/vmscape.rst
@@ -86,6 +86,10 @@ The possible values in this file are:
    run a potentially malicious guest and issues an IBPB before the first
    exit to userspace after VM-exit.
 
+ * 'Mitigation: Clear BHB before exit to userspace':
+
+   As above, conditional BHB clearing mitigation is enabled.
+
  * 'Mitigation: IBPB on VMEXIT':
 
    IBPB is issued on every VM-exit. This occurs when other mitigations like
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 15a2fa8f2f48a066e102263513eff9537ac1d25f..1e8c26c37dbed4256b35101fb41c0e1eb6ef9272 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -388,6 +388,8 @@ extern void write_ibpb(void);
 
 #ifdef CONFIG_X86_64
 extern void clear_bhb_loop(void);
+#else
+static inline void clear_bhb_loop(void) {}
 #endif
 
 extern void (*x86_return_thunk)(void);
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index cbb3341b9a19f835738eda7226323d88b7e41e52..d12c07ccf59479ecf590935607394492c988b2ff 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -109,9 +109,8 @@ DEFINE_PER_CPU(u64, x86_spec_ctrl_current);
 EXPORT_PER_CPU_SYMBOL_GPL(x86_spec_ctrl_current);
 
 /*
- * Set when the CPU has run a potentially malicious guest. An IBPB will
- * be needed to before running userspace. That IBPB will flush the branch
- * predictor content.
+ * Set when the CPU has run a potentially malicious guest. Indicates that a
+ * branch predictor flush is needed before running userspace.
  */
 DEFINE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
 EXPORT_PER_CPU_SYMBOL_GPL(x86_predictor_flush_exit_to_user);
@@ -3200,13 +3199,15 @@ enum vmscape_mitigations {
 	VMSCAPE_MITIGATION_AUTO,
 	VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER,
 	VMSCAPE_MITIGATION_IBPB_ON_VMEXIT,
+	VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER,
 };
 
 static const char * const vmscape_strings[] = {
-	[VMSCAPE_MITIGATION_NONE]		= "Vulnerable",
+	[VMSCAPE_MITIGATION_NONE]			= "Vulnerable",
 	/* [VMSCAPE_MITIGATION_AUTO] */
-	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]	= "Mitigation: IBPB before exit to userspace",
-	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]	= "Mitigation: IBPB on VMEXIT",
+	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]		= "Mitigation: IBPB before exit to userspace",
+	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]		= "Mitigation: IBPB on VMEXIT",
+	[VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER]	= "Mitigation: Clear BHB before exit to userspace",
 };
 
 static enum vmscape_mitigations vmscape_mitigation __ro_after_init =
@@ -3253,8 +3254,19 @@ static void __init vmscape_select_mitigation(void)
 			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
 		break;
 
+	case VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER:
+		if (!boot_cpu_has(X86_FEATURE_BHI_CTRL))
+			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
+		break;
 	case VMSCAPE_MITIGATION_AUTO:
-		if (boot_cpu_has(X86_FEATURE_IBPB))
+		/*
+		 * CPUs with BHI_CTRL(ADL and newer) can avoid the IBPB and use BHB
+		 * clear sequence. These CPUs are only vulnerable to the BHI variant
+		 * of the VMSCAPE attack and does not require an IBPB flush.
+		 */
+		if (boot_cpu_has(X86_FEATURE_BHI_CTRL))
+			vmscape_mitigation = VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER;
+		else if (boot_cpu_has(X86_FEATURE_IBPB))
 			vmscape_mitigation = VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER;
 		else
 			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
@@ -3278,6 +3290,9 @@ static void __init vmscape_apply_mitigation(void)
 {
 	if (vmscape_mitigation == VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER)
 		static_call_update(vmscape_predictor_flush, write_ibpb);
+	else if (vmscape_mitigation == VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER &&
+		 IS_ENABLED(CONFIG_X86_64))
+		static_call_update(vmscape_predictor_flush, clear_bhb_loop);
 }
 
 #undef pr_fmt
@@ -3369,6 +3384,7 @@ void cpu_bugs_smt_update(void)
 		break;
 	case VMSCAPE_MITIGATION_IBPB_ON_VMEXIT:
 	case VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER:
+	case VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER:
 		/*
 		 * Hypervisors can be attacked across-threads, warn for SMT when
 		 * STIBP is not already enabled system-wide.

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 10/11] x86/vmscape: Override conflicting attack-vector controls with =force
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
                   ` (8 preceding siblings ...)
  2025-11-20  6:19 ` [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation Pawan Gupta
@ 2025-11-20  6:20 ` Pawan Gupta
  2025-11-21 18:04   ` Nikolay Borisov
  2025-11-20  6:20 ` [PATCH v4 11/11] x86/vmscape: Add cmdline vmscape=on to override attack vector controls Pawan Gupta
  10 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:20 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

vmscape=force option currently defaults to AUTO mitigation. This is not
correct because attack-vector controls override a mitigation when in AUTO
mode. This prevents a user from being able to force VMSCAPE mitigation when
it conflicts with attack-vector controls.

Kernel should deploy a forced mitigation irrespective of attack vectors.
Instead of AUTO, use VMSCAPE_MITIGATION_ON that wins over attack-vector
controls.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/kernel/cpu/bugs.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d12c07ccf59479ecf590935607394492c988b2ff..81b0db27f4094c90ebf4704c74f5e7e6b809560f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -3197,6 +3197,7 @@ static void __init srso_apply_mitigation(void)
 enum vmscape_mitigations {
 	VMSCAPE_MITIGATION_NONE,
 	VMSCAPE_MITIGATION_AUTO,
+	VMSCAPE_MITIGATION_ON,
 	VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER,
 	VMSCAPE_MITIGATION_IBPB_ON_VMEXIT,
 	VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER,
@@ -3205,6 +3206,7 @@ enum vmscape_mitigations {
 static const char * const vmscape_strings[] = {
 	[VMSCAPE_MITIGATION_NONE]			= "Vulnerable",
 	/* [VMSCAPE_MITIGATION_AUTO] */
+	/* [VMSCAPE_MITIGATION_ON] */
 	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]		= "Mitigation: IBPB before exit to userspace",
 	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]		= "Mitigation: IBPB on VMEXIT",
 	[VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER]	= "Mitigation: Clear BHB before exit to userspace",
@@ -3224,7 +3226,7 @@ static int __init vmscape_parse_cmdline(char *str)
 		vmscape_mitigation = VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER;
 	} else if (!strcmp(str, "force")) {
 		setup_force_cpu_bug(X86_BUG_VMSCAPE);
-		vmscape_mitigation = VMSCAPE_MITIGATION_AUTO;
+		vmscape_mitigation = VMSCAPE_MITIGATION_ON;
 	} else {
 		pr_err("Ignoring unknown vmscape=%s option.\n", str);
 	}
@@ -3259,6 +3261,7 @@ static void __init vmscape_select_mitigation(void)
 			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
 		break;
 	case VMSCAPE_MITIGATION_AUTO:
+	case VMSCAPE_MITIGATION_ON:
 		/*
 		 * CPUs with BHI_CTRL(ADL and newer) can avoid the IBPB and use BHB
 		 * clear sequence. These CPUs are only vulnerable to the BHI variant
@@ -3381,6 +3384,7 @@ void cpu_bugs_smt_update(void)
 	switch (vmscape_mitigation) {
 	case VMSCAPE_MITIGATION_NONE:
 	case VMSCAPE_MITIGATION_AUTO:
+	case VMSCAPE_MITIGATION_ON:
 		break;
 	case VMSCAPE_MITIGATION_IBPB_ON_VMEXIT:
 	case VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER:

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 11/11] x86/vmscape: Add cmdline vmscape=on to override attack vector controls
  2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
                   ` (9 preceding siblings ...)
  2025-11-20  6:20 ` [PATCH v4 10/11] x86/vmscape: Override conflicting attack-vector controls with =force Pawan Gupta
@ 2025-11-20  6:20 ` Pawan Gupta
  2025-11-25 11:41   ` Nikolay Borisov
  10 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20  6:20 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

In general, individual mitigation controls can be used to override the
attack vector controls. But, nothing exists to select BHB clearing
mitigation for VMSCAPE. The =force option comes close, but with a
side-effect of also forcibly setting the bug, hence deploying the
mitigation on unaffected parts too.

Add a new cmdline option vmscape=on to enable the mitigation based on the
VMSCAPE variant the CPU is affected by.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 Documentation/admin-guide/hw-vuln/vmscape.rst   | 4 ++++
 Documentation/admin-guide/kernel-parameters.txt | 4 +++-
 arch/x86/kernel/cpu/bugs.c                      | 2 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/hw-vuln/vmscape.rst b/Documentation/admin-guide/hw-vuln/vmscape.rst
index dc63a0bac03d43d1e295de0791dd6497d101f986..580f288ae8bfc601ff000d6d95d711bb9084459e 100644
--- a/Documentation/admin-guide/hw-vuln/vmscape.rst
+++ b/Documentation/admin-guide/hw-vuln/vmscape.rst
@@ -112,3 +112,7 @@ The mitigation can be controlled via the ``vmscape=`` command line parameter:
 
    Force vulnerability detection and mitigation even on processors that are
    not known to be affected.
+
+ * ``vmscape=on``:
+
+   Choose the mitigation based on the VMSCAPE variant the CPU is affected by.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6c42061ca20e581b5192b66c6f25aba38d4f8ff8..4b4711ced5e187495476b5365cd7b3df81db893b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -8104,9 +8104,11 @@
 
 			off		- disable the mitigation
 			ibpb		- use Indirect Branch Prediction Barrier
-					  (IBPB) mitigation (default)
+					  (IBPB) mitigation
 			force		- force vulnerability detection even on
 					  unaffected processors
+			on		- (default) automatically select IBPB
+			                  or BHB clear mitigation based on CPU
 
 	vsyscall=	[X86-64,EARLY]
 			Controls the behavior of vsyscalls (i.e. calls to
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 81b0db27f4094c90ebf4704c74f5e7e6b809560f..b4a21434869fcc01c40a2973f986a3f275f92ef2 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -3227,6 +3227,8 @@ static int __init vmscape_parse_cmdline(char *str)
 	} else if (!strcmp(str, "force")) {
 		setup_force_cpu_bug(X86_BUG_VMSCAPE);
 		vmscape_mitigation = VMSCAPE_MITIGATION_ON;
+	} else if (!strcmp(str, "on")) {
+		vmscape_mitigation = VMSCAPE_MITIGATION_ON;
 	} else {
 		pr_err("Ignoring unknown vmscape=%s option.\n", str);
 	}

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
  2025-11-20  6:17 ` [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop() Pawan Gupta
@ 2025-11-20 16:15   ` Nikolay Borisov
  2025-11-20 16:56     ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-20 16:15 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:17, Pawan Gupta wrote:
> Currently, BHB clearing sequence is followed by an LFENCE to prevent
> transient execution of subsequent indirect branches prematurely. However,
> LFENCE barrier could be unnecessary in certain cases. For example, when
> kernel is using BHI_DIS_S mitigation, and BHB clearing is only needed for
> userspace. In such cases, LFENCE is redundant because ring transitions
> would provide the necessary serialization.
> 
> Below is a quick recap of BHI mitigation options:
> 
>    On Alder Lake and newer
> 
>    - BHI_DIS_S: Hardware control to mitigate BHI in ring0. This has low
>                 performance overhead.
>    - Long loop: Alternatively, longer version of BHB clearing sequence
> 	       on older processors can be used to mitigate BHI. This
> 	       is not yet implemented in Linux.

I find this description of the Long loop on "ALder lake and newer" 
somewhat confusing, as you are also referring "older processors". 
Shouldn't the longer sequence bet moved under "On older CPUs" heading? 
Or perhaps it must be expanded to say that the long sequence could work 
on Alder Lake and newer CPUs as well as on older cpus?

> 
>    On older CPUs
> 
>    - Short loop: Clears BHB at kernel entry and VMexit.

<snip>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse
  2025-11-20  6:18 ` [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse Pawan Gupta
@ 2025-11-20 16:28   ` Nikolay Borisov
  2025-11-20 16:57     ` Pawan Gupta
  2025-11-25  0:21   ` Pawan Gupta
  1 sibling, 1 reply; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-20 16:28 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:18, Pawan Gupta wrote:
> In preparation to make clear_bhb_loop() work for CPUs with larger BHB, move
> the sequence to a macro. This will allow setting the depth of BHB-clearing
> easily via arguments.
> 
> No functional change intended.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
  2025-11-20 16:15   ` Nikolay Borisov
@ 2025-11-20 16:56     ` Pawan Gupta
  2025-11-20 16:58       ` Nikolay Borisov
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20 16:56 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang

On Thu, Nov 20, 2025 at 06:15:32PM +0200, Nikolay Borisov wrote:
> 
> 
> On 11/20/25 08:17, Pawan Gupta wrote:
> > Currently, BHB clearing sequence is followed by an LFENCE to prevent
> > transient execution of subsequent indirect branches prematurely. However,
> > LFENCE barrier could be unnecessary in certain cases. For example, when
> > kernel is using BHI_DIS_S mitigation, and BHB clearing is only needed for
> > userspace. In such cases, LFENCE is redundant because ring transitions
> > would provide the necessary serialization.
> > 
> > Below is a quick recap of BHI mitigation options:
> > 
> >    On Alder Lake and newer
> > 
> >    - BHI_DIS_S: Hardware control to mitigate BHI in ring0. This has low
> >                 performance overhead.
> >    - Long loop: Alternatively, longer version of BHB clearing sequence
> > 	       on older processors can be used to mitigate BHI. This
> > 	       is not yet implemented in Linux.
> 
> I find this description of the Long loop on "ALder lake and newer" somewhat
> confusing, as you are also referring "older processors". Shouldn't the
> longer sequence bet moved under "On older CPUs" heading? Or perhaps it must
> be expanded to say that the long sequence could work on Alder Lake and newer
> CPUs as well as on older cpus?

Ya, it needs to be rephrased. Would dropping "on older processors" help?

    - Long loop: Alternatively, longer version of BHB clearing sequence
		 can be used to mitigate BHI. This is not yet implemented
		 in Linux.

> > 
> >    On older CPUs
> > 
> >    - Short loop: Clears BHB at kernel entry and VMexit.

And also talk about "Long loop" effectiveness here:

    On older CPUs

    - Short loop: Clears BHB at kernel entry and VMexit. The "Long loop"
		  is effective on older CPUs as well, but should be avoided
		  because of unnecessary overhead.
> <snip>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse
  2025-11-20 16:28   ` Nikolay Borisov
@ 2025-11-20 16:57     ` Pawan Gupta
  0 siblings, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-20 16:57 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang

On Thu, Nov 20, 2025 at 06:28:51PM +0200, Nikolay Borisov wrote:
> 
> 
> On 11/20/25 08:18, Pawan Gupta wrote:
> > In preparation to make clear_bhb_loop() work for CPUs with larger BHB, move
> > the sequence to a macro. This will allow setting the depth of BHB-clearing
> > easily via arguments.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> 
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
  2025-11-20 16:56     ` Pawan Gupta
@ 2025-11-20 16:58       ` Nikolay Borisov
  0 siblings, 0 replies; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-20 16:58 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 18:56, Pawan Gupta wrote:
> On Thu, Nov 20, 2025 at 06:15:32PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 11/20/25 08:17, Pawan Gupta wrote:
>>> Currently, BHB clearing sequence is followed by an LFENCE to prevent
>>> transient execution of subsequent indirect branches prematurely. However,
>>> LFENCE barrier could be unnecessary in certain cases. For example, when
>>> kernel is using BHI_DIS_S mitigation, and BHB clearing is only needed for
>>> userspace. In such cases, LFENCE is redundant because ring transitions
>>> would provide the necessary serialization.
>>>
>>> Below is a quick recap of BHI mitigation options:
>>>
>>>     On Alder Lake and newer
>>>
>>>     - BHI_DIS_S: Hardware control to mitigate BHI in ring0. This has low
>>>                  performance overhead.
>>>     - Long loop: Alternatively, longer version of BHB clearing sequence
>>> 	       on older processors can be used to mitigate BHI. This
>>> 	       is not yet implemented in Linux.
>>
>> I find this description of the Long loop on "ALder lake and newer" somewhat
>> confusing, as you are also referring "older processors". Shouldn't the
>> longer sequence bet moved under "On older CPUs" heading? Or perhaps it must
>> be expanded to say that the long sequence could work on Alder Lake and newer
>> CPUs as well as on older cpus?
> 
> Ya, it needs to be rephrased. Would dropping "on older processors" help?
> 
>      - Long loop: Alternatively, longer version of BHB clearing sequence
> 		 can be used to mitigate BHI. This is not yet implemented
> 		 in Linux.
> 

nit: Perhaps a sentence about why long loop version might be used on 
newer parts in certain cases or why it shouldn't.

>>>
>>>     On older CPUs
>>>
>>>     - Short loop: Clears BHB at kernel entry and VMexit.
> 
> And also talk about "Long loop" effectiveness here:
> 
>      On older CPUs
> 
>      - Short loop: Clears BHB at kernel entry and VMexit. The "Long loop"
> 		  is effective on older CPUs as well, but should be avoided
> 		  because of unnecessary overhead.
>> <snip>

In any case it's much better and indeed clear!

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 03/11] x86/bhi: Make the depth of BHB-clearing configurable
  2025-11-20  6:18 ` [PATCH v4 03/11] x86/bhi: Make the depth of BHB-clearing configurable Pawan Gupta
@ 2025-11-20 17:02   ` Nikolay Borisov
  0 siblings, 0 replies; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-20 17:02 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:18, Pawan Gupta wrote:
> The BHB clearing sequence has two nested loops that determine the depth of
> BHB to be cleared. Introduce an argument to the macro to allow the loop
> count (and hence the depth) to be controlled from outside the macro.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-20  6:18 ` [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs Pawan Gupta
@ 2025-11-21 12:33   ` Nikolay Borisov
  2025-11-21 16:40   ` Dave Hansen
  2026-03-06 21:00   ` Jim Mattson
  2 siblings, 0 replies; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-21 12:33 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:18, Pawan Gupta wrote:
> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> the Branch History Buffer (BHB). On Alder Lake and newer parts this
> sequence is not sufficient because it doesn't clear enough entries. This
> was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> that mitigates BHI in kernel.
> 
> BHI variant of VMSCAPE requires isolating branch history between guests and
> userspace. Note that there is no equivalent hardware control for userspace.
> To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> should execute sufficient number of branches to clear a larger BHB.
> 
> Dynamically set the loop count of clear_bhb_loop() such that it is
> effective on newer CPUs too. Use the hardware control enumeration
> X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> 
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 07/11] x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier()
  2025-11-20  6:19 ` [PATCH v4 07/11] x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier() Pawan Gupta
@ 2025-11-21 12:59   ` Nikolay Borisov
  0 siblings, 0 replies; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-21 12:59 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:19, Pawan Gupta wrote:
> indirect_branch_prediction_barrier() is a wrapper to write_ibpb(), which
> also checks if the CPU supports IBPB. For VMSCAPE, call to
> indirect_branch_prediction_barrier() is only possible when CPU supports
> IBPB.
> 
> Simply call write_ibpb() directly to avoid unnecessary alternative
> patching.
> 
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation
  2025-11-20  6:19 ` [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation Pawan Gupta
@ 2025-11-21 14:18   ` Nikolay Borisov
  2025-11-21 18:29     ` Pawan Gupta
  2025-11-21 14:23   ` Nikolay Borisov
  1 sibling, 1 reply; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-21 14:18 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:19, Pawan Gupta wrote:
> IBPB mitigation for VMSCAPE is an overkill on CPUs that are only affected
> by the BHI variant of VMSCAPE. On such CPUs, eIBRS already provides
> indirect branch isolation between guest and host userspace. However, branch
> history from guest may also influence the indirect branches in host
> userspace.
> 
> To mitigate the BHI aspect, use clear_bhb_loop().
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

<snip>

> @@ -3278,6 +3290,9 @@ static void __init vmscape_apply_mitigation(void)
>   {
>   	if (vmscape_mitigation == VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER)
>   		static_call_update(vmscape_predictor_flush, write_ibpb);
> +	else if (vmscape_mitigation == VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER &&
> +		 IS_ENABLED(CONFIG_X86_64))

why the x86_64 dependency ?


> +		static_call_update(vmscape_predictor_flush, clear_bhb_loop);
>   }
>   
>   #undef pr_fmt
> @@ -3369,6 +3384,7 @@ void cpu_bugs_smt_update(void)
>   		break;
>   	case VMSCAPE_MITIGATION_IBPB_ON_VMEXIT:
>   	case VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER:
> +	case VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER:
>   		/*
>   		 * Hypervisors can be attacked across-threads, warn for SMT when
>   		 * STIBP is not already enabled system-wide.
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation
  2025-11-20  6:19 ` [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation Pawan Gupta
  2025-11-21 14:18   ` Nikolay Borisov
@ 2025-11-21 14:23   ` Nikolay Borisov
  2025-11-21 18:41     ` Pawan Gupta
  1 sibling, 1 reply; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-21 14:23 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:19, Pawan Gupta wrote:
> IBPB mitigation for VMSCAPE is an overkill on CPUs that are only affected
> by the BHI variant of VMSCAPE. On such CPUs, eIBRS already provides
> indirect branch isolation between guest and host userspace. However, branch
> history from guest may also influence the indirect branches in host
> userspace.
> 
> To mitigate the BHI aspect, use clear_bhb_loop().
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>   Documentation/admin-guide/hw-vuln/vmscape.rst |  4 ++++
>   arch/x86/include/asm/nospec-branch.h          |  2 ++
>   arch/x86/kernel/cpu/bugs.c                    | 30 ++++++++++++++++++++-------
>   3 files changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/hw-vuln/vmscape.rst b/Documentation/admin-guide/hw-vuln/vmscape.rst
> index d9b9a2b6c114c05a7325e5f3c9d42129339b870b..dc63a0bac03d43d1e295de0791dd6497d101f986 100644
> --- a/Documentation/admin-guide/hw-vuln/vmscape.rst
> +++ b/Documentation/admin-guide/hw-vuln/vmscape.rst
> @@ -86,6 +86,10 @@ The possible values in this file are:
>      run a potentially malicious guest and issues an IBPB before the first
>      exit to userspace after VM-exit.
>   
> + * 'Mitigation: Clear BHB before exit to userspace':
> +
> +   As above, conditional BHB clearing mitigation is enabled.
> +
>    * 'Mitigation: IBPB on VMEXIT':
>   
>      IBPB is issued on every VM-exit. This occurs when other mitigations like
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index 15a2fa8f2f48a066e102263513eff9537ac1d25f..1e8c26c37dbed4256b35101fb41c0e1eb6ef9272 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -388,6 +388,8 @@ extern void write_ibpb(void);
>   
>   #ifdef CONFIG_X86_64
>   extern void clear_bhb_loop(void);
> +#else
> +static inline void clear_bhb_loop(void) {}
>   #endif
>   
>   extern void (*x86_return_thunk)(void);
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index cbb3341b9a19f835738eda7226323d88b7e41e52..d12c07ccf59479ecf590935607394492c988b2ff 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -109,9 +109,8 @@ DEFINE_PER_CPU(u64, x86_spec_ctrl_current);
>   EXPORT_PER_CPU_SYMBOL_GPL(x86_spec_ctrl_current);
>   
>   /*
> - * Set when the CPU has run a potentially malicious guest. An IBPB will
> - * be needed to before running userspace. That IBPB will flush the branch
> - * predictor content.
> + * Set when the CPU has run a potentially malicious guest. Indicates that a
> + * branch predictor flush is needed before running userspace.
>    */
>   DEFINE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
>   EXPORT_PER_CPU_SYMBOL_GPL(x86_predictor_flush_exit_to_user);
> @@ -3200,13 +3199,15 @@ enum vmscape_mitigations {
>   	VMSCAPE_MITIGATION_AUTO,
>   	VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER,
>   	VMSCAPE_MITIGATION_IBPB_ON_VMEXIT,
> +	VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER,
>   };
>   
>   static const char * const vmscape_strings[] = {
> -	[VMSCAPE_MITIGATION_NONE]		= "Vulnerable",
> +	[VMSCAPE_MITIGATION_NONE]			= "Vulnerable",
>   	/* [VMSCAPE_MITIGATION_AUTO] */
> -	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]	= "Mitigation: IBPB before exit to userspace",
> -	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]	= "Mitigation: IBPB on VMEXIT",
> +	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]		= "Mitigation: IBPB before exit to userspace",
> +	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]		= "Mitigation: IBPB on VMEXIT",
> +	[VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER]	= "Mitigation: Clear BHB before exit to userspace",
>   };
>   
>   static enum vmscape_mitigations vmscape_mitigation __ro_after_init =
> @@ -3253,8 +3254,19 @@ static void __init vmscape_select_mitigation(void)
>   			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
>   		break;
>   
> +	case VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER:
> +		if (!boot_cpu_has(X86_FEATURE_BHI_CTRL))
> +			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> +		break;

Am I missing something or this case can never execute because 
VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER is only ever set if mitigation 
is VMSCAPE_MITIGATION_AUTO in the below branch? Perhaps just remove it? 
This just shows how confusing the logic for choosing the mitigations has 
become....



>   	case VMSCAPE_MITIGATION_AUTO:
> -		if (boot_cpu_has(X86_FEATURE_IBPB))
> +		/*
> +		 * CPUs with BHI_CTRL(ADL and newer) can avoid the IBPB and use BHB
> +		 * clear sequence. These CPUs are only vulnerable to the BHI variant
> +		 * of the VMSCAPE attack and does not require an IBPB flush.
> +		 */
> +		if (boot_cpu_has(X86_FEATURE_BHI_CTRL))
> +			vmscape_mitigation = VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER;
> +		else if (boot_cpu_has(X86_FEATURE_IBPB))
>   			vmscape_mitigation = VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER;
>   		else
>   			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;


<snip>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch()
  2025-11-20  6:19 ` [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch() Pawan Gupta
@ 2025-11-21 14:27   ` Nikolay Borisov
  2025-11-24 23:09     ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-21 14:27 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:19, Pawan Gupta wrote:
> This ensures that all mitigation modes are explicitly handled, while
> keeping the mitigation selection for each mode together. This also prepares
> for adding BHB-clearing mitigation mode for VMSCAPE.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>   arch/x86/kernel/cpu/bugs.c | 22 ++++++++++++++++++----
>   1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 1e9b11198db0fe2483bd17b1327bcfd44a2c1dbf..233594ede19bf971c999f4d3cc0f6f213002c16c 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -3231,17 +3231,31 @@ early_param("vmscape", vmscape_parse_cmdline);
>   
>   static void __init vmscape_select_mitigation(void)
>   {
> -	if (!boot_cpu_has_bug(X86_BUG_VMSCAPE) ||
> -	    !boot_cpu_has(X86_FEATURE_IBPB)) {
> +	if (!boot_cpu_has_bug(X86_BUG_VMSCAPE)) {
>   		vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
>   		return;
>   	}
>   
> -	if (vmscape_mitigation == VMSCAPE_MITIGATION_AUTO) {
> -		if (should_mitigate_vuln(X86_BUG_VMSCAPE))
> +	if ((vmscape_mitigation == VMSCAPE_MITIGATION_AUTO) &&
> +	    !should_mitigate_vuln(X86_BUG_VMSCAPE))
> +		vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> +
> +	switch (vmscape_mitigation) {
> +	case VMSCAPE_MITIGATION_NONE:
> +		break;
> +
> +	case VMSCAPE_MITIGATION_IBPB_ON_VMEXIT:
> +	case VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER:
> +		if (!boot_cpu_has(X86_FEATURE_IBPB))
> +			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> +		break;
> +
> +	case VMSCAPE_MITIGATION_AUTO:
> +		if (boot_cpu_has(X86_FEATURE_IBPB))
>   			vmscape_mitigation = VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER;


IMO this patch is a net-negative because as per my reply to patch 9 you 
have effectively a dead branch:

The clear BHB_CLEAR_USER one, however it turns out you have yet another 
one: VMSCAPE_MITIGATION_IBPB_ON_VMEXIT as it's only ever set in 
vmscape_update_mitigation() which executes after '_select()' as well and 
additionally you duplicate the FEATURE_IBPB check.

So I think either dropping it or removing the superfluous branches is in 
order.

>   		else
>   			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> +		break;
>   	}
>   }
>   
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-20  6:18 ` [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs Pawan Gupta
  2025-11-21 12:33   ` Nikolay Borisov
@ 2025-11-21 16:40   ` Dave Hansen
  2025-11-21 16:45     ` Nikolay Borisov
  2025-11-26 19:23     ` Pawan Gupta
  2026-03-06 21:00   ` Jim Mattson
  2 siblings, 2 replies; 63+ messages in thread
From: Dave Hansen @ 2025-11-21 16:40 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, Peter Zijlstra

On 11/19/25 22:18, Pawan Gupta wrote:
> -	CLEAR_BHB_LOOP_SEQ 5, 5
> +	/* loop count differs based on CPU-gen, see Intel's BHI guidance */
> +	ALTERNATIVE (CLEAR_BHB_LOOP_SEQ 5, 5),  \
> +		    __stringify(CLEAR_BHB_LOOP_SEQ 12, 7), X86_FEATURE_BHI_CTRL

There are a million ways to skin this cat. But I'm not sure I really
like the end result here. It seems a little overkill to use ALTERNATIVE
to rewrite a whole sequence just to patch two constants in there.

What if the CLEAR_BHB_LOOP_SEQ just took its inner and outer loop counts
as register arguments? Then this would look more like:

	ALTERNATIVE "mov  $5, %rdi; mov $5, %rsi",
		    "mov $12, %rdi; mov $7, %rsi",
	...

	CLEAR_BHB_LOOP_SEQ

Or, even global variables:

	mov outer_loop_count(%rip), %rdi
	mov inner_loop_count(%rip), %rsi

and then have some C code somewhere that does:

	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
		outer_loop_count = 5;
		inner_loop_count = 5;
	} else {
		outer_loop_count = 12;
		inner_loop_count = 7;
	}

... and I'm sure I got something wrong in there like flipping the
inner/outer counts, and I'm not even thinking about the variable types.

But, basically, I think I want to avoid as much logic as possible in
assembly. I also think we should reserve ALTERNATIVE for things that
truly need it, like things that are truly performance sensitive or that
can't reach out and poke at variables.

Peter Z. usually has good instincts on these things, so I'm curious what
he thinks of all this.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 16:40   ` Dave Hansen
@ 2025-11-21 16:45     ` Nikolay Borisov
  2025-11-21 16:50       ` Dave Hansen
  2025-11-26 19:23     ` Pawan Gupta
  1 sibling, 1 reply; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-21 16:45 UTC (permalink / raw)
  To: Dave Hansen, Pawan Gupta, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, Peter Zijlstra



On 11/21/25 18:40, Dave Hansen wrote:
> On 11/19/25 22:18, Pawan Gupta wrote:
>> -	CLEAR_BHB_LOOP_SEQ 5, 5
>> +	/* loop count differs based on CPU-gen, see Intel's BHI guidance */
>> +	ALTERNATIVE (CLEAR_BHB_LOOP_SEQ 5, 5),  \
>> +		    __stringify(CLEAR_BHB_LOOP_SEQ 12, 7), X86_FEATURE_BHI_CTRL
> 
> There are a million ways to skin this cat. But I'm not sure I really
> like the end result here. It seems a little overkill to use ALTERNATIVE
> to rewrite a whole sequence just to patch two constants in there.
> 
> What if the CLEAR_BHB_LOOP_SEQ just took its inner and outer loop counts
> as register arguments? Then this would look more like:
> 
> 	ALTERNATIVE "mov  $5, %rdi; mov $5, %rsi",
> 		    "mov $12, %rdi; mov $7, %rsi",
> 	...
> 
> 	CLEAR_BHB_LOOP_SEQ
> 
> Or, even global variables:
> 
> 	mov outer_loop_count(%rip), %rdi
> 	mov inner_loop_count(%rip), %rsi

nit: FWIW I find this rather tacky, because the way the registers are 
being used (although they do follow the x86-64 calling convention) is 
obfuscated in the macro itself.

> 
> and then have some C code somewhere that does:
> 
> 	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> 		outer_loop_count = 5;
> 		inner_loop_count = 5;
> 	} else {
> 		outer_loop_count = 12;
> 		inner_loop_count = 7;
> 	}

OTOH: the global variable approach seems saner as in the macro you'd 
have direct reference to them and so it will be more obvious how things 
are setup.

<snip>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 16:45     ` Nikolay Borisov
@ 2025-11-21 16:50       ` Dave Hansen
  2025-11-21 18:16         ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Hansen @ 2025-11-21 16:50 UTC (permalink / raw)
  To: Nikolay Borisov, Pawan Gupta, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, Peter Zijlstra

On 11/21/25 08:45, Nikolay Borisov wrote:
> OTOH: the global variable approach seems saner as in the macro you'd
> have direct reference to them and so it will be more obvious how things
> are setup.

Oh, yeah, duh. You don't need to pass the variables in registers. They
could just be read directly.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 10/11] x86/vmscape: Override conflicting attack-vector controls with =force
  2025-11-20  6:20 ` [PATCH v4 10/11] x86/vmscape: Override conflicting attack-vector controls with =force Pawan Gupta
@ 2025-11-21 18:04   ` Nikolay Borisov
  0 siblings, 0 replies; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-21 18:04 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:20, Pawan Gupta wrote:
> vmscape=force option currently defaults to AUTO mitigation. This is not
> correct because attack-vector controls override a mitigation when in AUTO
> mode. This prevents a user from being able to force VMSCAPE mitigation when
> it conflicts with attack-vector controls.
> 
> Kernel should deploy a forced mitigation irrespective of attack vectors.
> Instead of AUTO, use VMSCAPE_MITIGATION_ON that wins over attack-vector
> controls.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 16:50       ` Dave Hansen
@ 2025-11-21 18:16         ` Pawan Gupta
  2025-11-21 18:42           ` Dave Hansen
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-21 18:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:
> On 11/21/25 08:45, Nikolay Borisov wrote:
> > OTOH: the global variable approach seems saner as in the macro you'd
> > have direct reference to them and so it will be more obvious how things
> > are setup.
> 
> Oh, yeah, duh. You don't need to pass the variables in registers. They
> could just be read directly.

IIUC, global variables would introduce extra memory loads that may slow
things down. I will try to measure their impact. I think those global
variables should be in the .entry.text section to play well with PTI.

Also I was preferring constants because load values from global variables
may also be subject to speculation. Although any speculation should be
corrected before an indirect branch is executed because of the LFENCE after
the sequence.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation
  2025-11-21 14:18   ` Nikolay Borisov
@ 2025-11-21 18:29     ` Pawan Gupta
  0 siblings, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-21 18:29 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang

On Fri, Nov 21, 2025 at 04:18:09PM +0200, Nikolay Borisov wrote:
> 
> 
> On 11/20/25 08:19, Pawan Gupta wrote:
> > IBPB mitigation for VMSCAPE is an overkill on CPUs that are only affected
> > by the BHI variant of VMSCAPE. On such CPUs, eIBRS already provides
> > indirect branch isolation between guest and host userspace. However, branch
> > history from guest may also influence the indirect branches in host
> > userspace.
> > 
> > To mitigate the BHI aspect, use clear_bhb_loop().
> > 
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> 
> <snip>
> 
> > @@ -3278,6 +3290,9 @@ static void __init vmscape_apply_mitigation(void)
> >   {
> >   	if (vmscape_mitigation == VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER)
> >   		static_call_update(vmscape_predictor_flush, write_ibpb);
> > +	else if (vmscape_mitigation == VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER &&
> > +		 IS_ENABLED(CONFIG_X86_64))
> 
> why the x86_64 dependency ?

BHI sequence mitigation is only supported in 64-bit mode. I will add a
comment. Looking at it again, I realized that 64-bit check should be in
vmscape_select_mitigation(), otherwise we report incorrectly on 32-bit.

> > +		static_call_update(vmscape_predictor_flush, clear_bhb_loop);
> >   }
> >   #undef pr_fmt
> > @@ -3369,6 +3384,7 @@ void cpu_bugs_smt_update(void)
> >   		break;
> >   	case VMSCAPE_MITIGATION_IBPB_ON_VMEXIT:
> >   	case VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER:
> > +	case VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER:
> >   		/*
> >   		 * Hypervisors can be attacked across-threads, warn for SMT when
> >   		 * STIBP is not already enabled system-wide.
> > 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation
  2025-11-21 14:23   ` Nikolay Borisov
@ 2025-11-21 18:41     ` Pawan Gupta
  2025-11-21 18:53       ` Nikolay Borisov
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-21 18:41 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang

On Fri, Nov 21, 2025 at 04:23:56PM +0200, Nikolay Borisov wrote:
> 
> 
> On 11/20/25 08:19, Pawan Gupta wrote:
> > IBPB mitigation for VMSCAPE is an overkill on CPUs that are only affected
> > by the BHI variant of VMSCAPE. On such CPUs, eIBRS already provides
> > indirect branch isolation between guest and host userspace. However, branch
> > history from guest may also influence the indirect branches in host
> > userspace.
> > 
> > To mitigate the BHI aspect, use clear_bhb_loop().
> > 
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> >   Documentation/admin-guide/hw-vuln/vmscape.rst |  4 ++++
> >   arch/x86/include/asm/nospec-branch.h          |  2 ++
> >   arch/x86/kernel/cpu/bugs.c                    | 30 ++++++++++++++++++++-------
> >   3 files changed, 29 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/hw-vuln/vmscape.rst b/Documentation/admin-guide/hw-vuln/vmscape.rst
> > index d9b9a2b6c114c05a7325e5f3c9d42129339b870b..dc63a0bac03d43d1e295de0791dd6497d101f986 100644
> > --- a/Documentation/admin-guide/hw-vuln/vmscape.rst
> > +++ b/Documentation/admin-guide/hw-vuln/vmscape.rst
> > @@ -86,6 +86,10 @@ The possible values in this file are:
> >      run a potentially malicious guest and issues an IBPB before the first
> >      exit to userspace after VM-exit.
> > + * 'Mitigation: Clear BHB before exit to userspace':
> > +
> > +   As above, conditional BHB clearing mitigation is enabled.
> > +
> >    * 'Mitigation: IBPB on VMEXIT':
> >      IBPB is issued on every VM-exit. This occurs when other mitigations like
> > diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> > index 15a2fa8f2f48a066e102263513eff9537ac1d25f..1e8c26c37dbed4256b35101fb41c0e1eb6ef9272 100644
> > --- a/arch/x86/include/asm/nospec-branch.h
> > +++ b/arch/x86/include/asm/nospec-branch.h
> > @@ -388,6 +388,8 @@ extern void write_ibpb(void);
> >   #ifdef CONFIG_X86_64
> >   extern void clear_bhb_loop(void);
> > +#else
> > +static inline void clear_bhb_loop(void) {}
> >   #endif
> >   extern void (*x86_return_thunk)(void);
> > diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> > index cbb3341b9a19f835738eda7226323d88b7e41e52..d12c07ccf59479ecf590935607394492c988b2ff 100644
> > --- a/arch/x86/kernel/cpu/bugs.c
> > +++ b/arch/x86/kernel/cpu/bugs.c
> > @@ -109,9 +109,8 @@ DEFINE_PER_CPU(u64, x86_spec_ctrl_current);
> >   EXPORT_PER_CPU_SYMBOL_GPL(x86_spec_ctrl_current);
> >   /*
> > - * Set when the CPU has run a potentially malicious guest. An IBPB will
> > - * be needed to before running userspace. That IBPB will flush the branch
> > - * predictor content.
> > + * Set when the CPU has run a potentially malicious guest. Indicates that a
> > + * branch predictor flush is needed before running userspace.
> >    */
> >   DEFINE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
> >   EXPORT_PER_CPU_SYMBOL_GPL(x86_predictor_flush_exit_to_user);
> > @@ -3200,13 +3199,15 @@ enum vmscape_mitigations {
> >   	VMSCAPE_MITIGATION_AUTO,
> >   	VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER,
> >   	VMSCAPE_MITIGATION_IBPB_ON_VMEXIT,
> > +	VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER,
> >   };
> >   static const char * const vmscape_strings[] = {
> > -	[VMSCAPE_MITIGATION_NONE]		= "Vulnerable",
> > +	[VMSCAPE_MITIGATION_NONE]			= "Vulnerable",
> >   	/* [VMSCAPE_MITIGATION_AUTO] */
> > -	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]	= "Mitigation: IBPB before exit to userspace",
> > -	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]	= "Mitigation: IBPB on VMEXIT",
> > +	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]		= "Mitigation: IBPB before exit to userspace",
> > +	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]		= "Mitigation: IBPB on VMEXIT",
> > +	[VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER]	= "Mitigation: Clear BHB before exit to userspace",
> >   };
> >   static enum vmscape_mitigations vmscape_mitigation __ro_after_init =
> > @@ -3253,8 +3254,19 @@ static void __init vmscape_select_mitigation(void)
> >   			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> >   		break;
> > +	case VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER:
> > +		if (!boot_cpu_has(X86_FEATURE_BHI_CTRL))
> > +			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> > +		break;
> 
> Am I missing something or this case can never execute because
> VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER is only ever set if mitigation is
> VMSCAPE_MITIGATION_AUTO in the below branch? Perhaps just remove it? This
> just shows how confusing the logic for choosing the mitigations has
> become....

The goal was not make any assumptions on what vmscape_parse_cmdline() can
and cannot set. If you feel strongly about it, I can remove this case.

> >   	case VMSCAPE_MITIGATION_AUTO:
> > -		if (boot_cpu_has(X86_FEATURE_IBPB))
> > +		/*
> > +		 * CPUs with BHI_CTRL(ADL and newer) can avoid the IBPB and use BHB
> > +		 * clear sequence. These CPUs are only vulnerable to the BHI variant
> > +		 * of the VMSCAPE attack and does not require an IBPB flush.
> > +		 */
> > +		if (boot_cpu_has(X86_FEATURE_BHI_CTRL))
> > +			vmscape_mitigation = VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER;
> > +		else if (boot_cpu_has(X86_FEATURE_IBPB))
> >   			vmscape_mitigation = VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER;
> >   		else
> >   			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> 
> 
> <snip>
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 18:16         ` Pawan Gupta
@ 2025-11-21 18:42           ` Dave Hansen
  2025-11-21 21:26             ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Hansen @ 2025-11-21 18:42 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On 11/21/25 10:16, Pawan Gupta wrote:
> On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:
>> On 11/21/25 08:45, Nikolay Borisov wrote:
>>> OTOH: the global variable approach seems saner as in the macro you'd
>>> have direct reference to them and so it will be more obvious how things
>>> are setup.
>>
>> Oh, yeah, duh. You don't need to pass the variables in registers. They
>> could just be read directly.
> 
> IIUC, global variables would introduce extra memory loads that may slow
> things down. I will try to measure their impact. I think those global
> variables should be in the .entry.text section to play well with PTI.

Really? I didn't look exhaustively, but CLEAR_BRANCH_HISTORY seems to
get called pretty close to where the assembly jumps into C. Long after
we're running on the kernel CR3.

> Also I was preferring constants because load values from global variables
> may also be subject to speculation. Although any speculation should be
> corrected before an indirect branch is executed because of the LFENCE after
> the sequence.

I guess that's a theoretical problem, but it's not a practical one.

So I think we have 4-ish options at this point:

1. Generate the long and short sequences independently and in their
   entirety and ALTERNATIVE between them (the original patch)
2. Store the inner/outer loop counts in registers and:
  2a. Load those registers from variables
  2b. Load them from ALTERNATIVES
3. Store the inner/outer loop counts in variables in memory

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation
  2025-11-21 18:41     ` Pawan Gupta
@ 2025-11-21 18:53       ` Nikolay Borisov
  2025-11-21 21:29         ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-21 18:53 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/21/25 20:41, Pawan Gupta wrote:
> On Fri, Nov 21, 2025 at 04:23:56PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 11/20/25 08:19, Pawan Gupta wrote:
>>> IBPB mitigation for VMSCAPE is an overkill on CPUs that are only affected
>>> by the BHI variant of VMSCAPE. On such CPUs, eIBRS already provides
>>> indirect branch isolation between guest and host userspace. However, branch
>>> history from guest may also influence the indirect branches in host
>>> userspace.
>>>
>>> To mitigate the BHI aspect, use clear_bhb_loop().
>>>
>>> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
>>> ---
>>>    Documentation/admin-guide/hw-vuln/vmscape.rst |  4 ++++
>>>    arch/x86/include/asm/nospec-branch.h          |  2 ++
>>>    arch/x86/kernel/cpu/bugs.c                    | 30 ++++++++++++++++++++-------
>>>    3 files changed, 29 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/hw-vuln/vmscape.rst b/Documentation/admin-guide/hw-vuln/vmscape.rst
>>> index d9b9a2b6c114c05a7325e5f3c9d42129339b870b..dc63a0bac03d43d1e295de0791dd6497d101f986 100644
>>> --- a/Documentation/admin-guide/hw-vuln/vmscape.rst
>>> +++ b/Documentation/admin-guide/hw-vuln/vmscape.rst
>>> @@ -86,6 +86,10 @@ The possible values in this file are:
>>>       run a potentially malicious guest and issues an IBPB before the first
>>>       exit to userspace after VM-exit.
>>> + * 'Mitigation: Clear BHB before exit to userspace':
>>> +
>>> +   As above, conditional BHB clearing mitigation is enabled.
>>> +
>>>     * 'Mitigation: IBPB on VMEXIT':
>>>       IBPB is issued on every VM-exit. This occurs when other mitigations like
>>> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
>>> index 15a2fa8f2f48a066e102263513eff9537ac1d25f..1e8c26c37dbed4256b35101fb41c0e1eb6ef9272 100644
>>> --- a/arch/x86/include/asm/nospec-branch.h
>>> +++ b/arch/x86/include/asm/nospec-branch.h
>>> @@ -388,6 +388,8 @@ extern void write_ibpb(void);
>>>    #ifdef CONFIG_X86_64
>>>    extern void clear_bhb_loop(void);
>>> +#else
>>> +static inline void clear_bhb_loop(void) {}
>>>    #endif
>>>    extern void (*x86_return_thunk)(void);
>>> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
>>> index cbb3341b9a19f835738eda7226323d88b7e41e52..d12c07ccf59479ecf590935607394492c988b2ff 100644
>>> --- a/arch/x86/kernel/cpu/bugs.c
>>> +++ b/arch/x86/kernel/cpu/bugs.c
>>> @@ -109,9 +109,8 @@ DEFINE_PER_CPU(u64, x86_spec_ctrl_current);
>>>    EXPORT_PER_CPU_SYMBOL_GPL(x86_spec_ctrl_current);
>>>    /*
>>> - * Set when the CPU has run a potentially malicious guest. An IBPB will
>>> - * be needed to before running userspace. That IBPB will flush the branch
>>> - * predictor content.
>>> + * Set when the CPU has run a potentially malicious guest. Indicates that a
>>> + * branch predictor flush is needed before running userspace.
>>>     */
>>>    DEFINE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
>>>    EXPORT_PER_CPU_SYMBOL_GPL(x86_predictor_flush_exit_to_user);
>>> @@ -3200,13 +3199,15 @@ enum vmscape_mitigations {
>>>    	VMSCAPE_MITIGATION_AUTO,
>>>    	VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER,
>>>    	VMSCAPE_MITIGATION_IBPB_ON_VMEXIT,
>>> +	VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER,
>>>    };
>>>    static const char * const vmscape_strings[] = {
>>> -	[VMSCAPE_MITIGATION_NONE]		= "Vulnerable",
>>> +	[VMSCAPE_MITIGATION_NONE]			= "Vulnerable",
>>>    	/* [VMSCAPE_MITIGATION_AUTO] */
>>> -	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]	= "Mitigation: IBPB before exit to userspace",
>>> -	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]	= "Mitigation: IBPB on VMEXIT",
>>> +	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]		= "Mitigation: IBPB before exit to userspace",
>>> +	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]		= "Mitigation: IBPB on VMEXIT",
>>> +	[VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER]	= "Mitigation: Clear BHB before exit to userspace",
>>>    };
>>>    static enum vmscape_mitigations vmscape_mitigation __ro_after_init =
>>> @@ -3253,8 +3254,19 @@ static void __init vmscape_select_mitigation(void)
>>>    			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
>>>    		break;
>>> +	case VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER:
>>> +		if (!boot_cpu_has(X86_FEATURE_BHI_CTRL))
>>> +			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
>>> +		break;
>>
>> Am I missing something or this case can never execute because
>> VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER is only ever set if mitigation is
>> VMSCAPE_MITIGATION_AUTO in the below branch? Perhaps just remove it? This
>> just shows how confusing the logic for choosing the mitigations has
>> become....
> 
> The goal was not make any assumptions on what vmscape_parse_cmdline() can
> and cannot set. If you feel strongly about it, I can remove this case.

 From where I'm standing bugs.c is already rather hairy even after 
multiple rounds of cleanups and brushups, if we can remove code - I'll 
be up for it. At the very least  in the commit message you can 
explicitly mention that you handle every case on-principle, and you 
expect that some of it is dead code. Still, I think the best code is the 
one which doesn't exist  and you won't have to worry about it.

<snip>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 18:42           ` Dave Hansen
@ 2025-11-21 21:26             ` Pawan Gupta
  2025-11-21 21:36               ` Dave Hansen
  2025-11-22 11:05               ` david laight
  0 siblings, 2 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-21 21:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
> On 11/21/25 10:16, Pawan Gupta wrote:
> > On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:
> >> On 11/21/25 08:45, Nikolay Borisov wrote:
> >>> OTOH: the global variable approach seems saner as in the macro you'd
> >>> have direct reference to them and so it will be more obvious how things
> >>> are setup.
> >>
> >> Oh, yeah, duh. You don't need to pass the variables in registers. They
> >> could just be read directly.
> > 
> > IIUC, global variables would introduce extra memory loads that may slow
> > things down. I will try to measure their impact. I think those global
> > variables should be in the .entry.text section to play well with PTI.
> 
> Really? I didn't look exhaustively, but CLEAR_BRANCH_HISTORY seems to
> get called pretty close to where the assembly jumps into C. Long after
> we're running on the kernel CR3.

You are right. PTI is not a concern here.

> > Also I was preferring constants because load values from global variables
> > may also be subject to speculation. Although any speculation should be
> > corrected before an indirect branch is executed because of the LFENCE after
> > the sequence.
> 
> I guess that's a theoretical problem, but it's not a practical one.

Probably yes. But, load from memory would certainly be slower compared to
immediates.

> So I think we have 4-ish options at this point:
> 
> 1. Generate the long and short sequences independently and in their
>    entirety and ALTERNATIVE between them (the original patch)
> 2. Store the inner/outer loop counts in registers and:
>   2a. Load those registers from variables
>   2b. Load them from ALTERNATIVES

Both of these look to be good options to me.

2b. would be my first preference, because it keeps the loop counts as
inline constants. The resulting sequence stays the same as it is today.

> 3. Store the inner/outer loop counts in variables in memory

I could be wrong, but this will likely have non-zero impact on performance.
I am afraid to cause any regressions in BHI mitigation. That is why I
preferred the least invasive approach in my previous attempts.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation
  2025-11-21 18:53       ` Nikolay Borisov
@ 2025-11-21 21:29         ` Pawan Gupta
  0 siblings, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-21 21:29 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang

On Fri, Nov 21, 2025 at 08:53:38PM +0200, Nikolay Borisov wrote:
> 
> 
> On 11/21/25 20:41, Pawan Gupta wrote:
> > On Fri, Nov 21, 2025 at 04:23:56PM +0200, Nikolay Borisov wrote:
> > > 
> > > 
> > > On 11/20/25 08:19, Pawan Gupta wrote:
> > > > IBPB mitigation for VMSCAPE is an overkill on CPUs that are only affected
> > > > by the BHI variant of VMSCAPE. On such CPUs, eIBRS already provides
> > > > indirect branch isolation between guest and host userspace. However, branch
> > > > history from guest may also influence the indirect branches in host
> > > > userspace.
> > > > 
> > > > To mitigate the BHI aspect, use clear_bhb_loop().
> > > > 
> > > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > > > ---
> > > >    Documentation/admin-guide/hw-vuln/vmscape.rst |  4 ++++
> > > >    arch/x86/include/asm/nospec-branch.h          |  2 ++
> > > >    arch/x86/kernel/cpu/bugs.c                    | 30 ++++++++++++++++++++-------
> > > >    3 files changed, 29 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/Documentation/admin-guide/hw-vuln/vmscape.rst b/Documentation/admin-guide/hw-vuln/vmscape.rst
> > > > index d9b9a2b6c114c05a7325e5f3c9d42129339b870b..dc63a0bac03d43d1e295de0791dd6497d101f986 100644
> > > > --- a/Documentation/admin-guide/hw-vuln/vmscape.rst
> > > > +++ b/Documentation/admin-guide/hw-vuln/vmscape.rst
> > > > @@ -86,6 +86,10 @@ The possible values in this file are:
> > > >       run a potentially malicious guest and issues an IBPB before the first
> > > >       exit to userspace after VM-exit.
> > > > + * 'Mitigation: Clear BHB before exit to userspace':
> > > > +
> > > > +   As above, conditional BHB clearing mitigation is enabled.
> > > > +
> > > >     * 'Mitigation: IBPB on VMEXIT':
> > > >       IBPB is issued on every VM-exit. This occurs when other mitigations like
> > > > diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> > > > index 15a2fa8f2f48a066e102263513eff9537ac1d25f..1e8c26c37dbed4256b35101fb41c0e1eb6ef9272 100644
> > > > --- a/arch/x86/include/asm/nospec-branch.h
> > > > +++ b/arch/x86/include/asm/nospec-branch.h
> > > > @@ -388,6 +388,8 @@ extern void write_ibpb(void);
> > > >    #ifdef CONFIG_X86_64
> > > >    extern void clear_bhb_loop(void);
> > > > +#else
> > > > +static inline void clear_bhb_loop(void) {}
> > > >    #endif
> > > >    extern void (*x86_return_thunk)(void);
> > > > diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> > > > index cbb3341b9a19f835738eda7226323d88b7e41e52..d12c07ccf59479ecf590935607394492c988b2ff 100644
> > > > --- a/arch/x86/kernel/cpu/bugs.c
> > > > +++ b/arch/x86/kernel/cpu/bugs.c
> > > > @@ -109,9 +109,8 @@ DEFINE_PER_CPU(u64, x86_spec_ctrl_current);
> > > >    EXPORT_PER_CPU_SYMBOL_GPL(x86_spec_ctrl_current);
> > > >    /*
> > > > - * Set when the CPU has run a potentially malicious guest. An IBPB will
> > > > - * be needed to before running userspace. That IBPB will flush the branch
> > > > - * predictor content.
> > > > + * Set when the CPU has run a potentially malicious guest. Indicates that a
> > > > + * branch predictor flush is needed before running userspace.
> > > >     */
> > > >    DEFINE_PER_CPU(bool, x86_predictor_flush_exit_to_user);
> > > >    EXPORT_PER_CPU_SYMBOL_GPL(x86_predictor_flush_exit_to_user);
> > > > @@ -3200,13 +3199,15 @@ enum vmscape_mitigations {
> > > >    	VMSCAPE_MITIGATION_AUTO,
> > > >    	VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER,
> > > >    	VMSCAPE_MITIGATION_IBPB_ON_VMEXIT,
> > > > +	VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER,
> > > >    };
> > > >    static const char * const vmscape_strings[] = {
> > > > -	[VMSCAPE_MITIGATION_NONE]		= "Vulnerable",
> > > > +	[VMSCAPE_MITIGATION_NONE]			= "Vulnerable",
> > > >    	/* [VMSCAPE_MITIGATION_AUTO] */
> > > > -	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]	= "Mitigation: IBPB before exit to userspace",
> > > > -	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]	= "Mitigation: IBPB on VMEXIT",
> > > > +	[VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER]		= "Mitigation: IBPB before exit to userspace",
> > > > +	[VMSCAPE_MITIGATION_IBPB_ON_VMEXIT]		= "Mitigation: IBPB on VMEXIT",
> > > > +	[VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER]	= "Mitigation: Clear BHB before exit to userspace",
> > > >    };
> > > >    static enum vmscape_mitigations vmscape_mitigation __ro_after_init =
> > > > @@ -3253,8 +3254,19 @@ static void __init vmscape_select_mitigation(void)
> > > >    			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> > > >    		break;
> > > > +	case VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER:
> > > > +		if (!boot_cpu_has(X86_FEATURE_BHI_CTRL))
> > > > +			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> > > > +		break;
> > > 
> > > Am I missing something or this case can never execute because
> > > VMSCAPE_MITIGATION_BHB_CLEAR_EXIT_TO_USER is only ever set if mitigation is
> > > VMSCAPE_MITIGATION_AUTO in the below branch? Perhaps just remove it? This
> > > just shows how confusing the logic for choosing the mitigations has
> > > become....
> > 
> > The goal was not make any assumptions on what vmscape_parse_cmdline() can
> > and cannot set. If you feel strongly about it, I can remove this case.
> 
> From where I'm standing bugs.c is already rather hairy even after multiple
> rounds of cleanups and brushups, if we can remove code - I'll be up for it.
> At the very least  in the commit message you can explicitly mention that you
> handle every case on-principle, and you expect that some of it is dead code.
> Still, I think the best code is the one which doesn't exist  and you won't
> have to worry about it.

Makes sense. I will get rid of those cases.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 21:26             ` Pawan Gupta
@ 2025-11-21 21:36               ` Dave Hansen
  2025-11-24 19:21                 ` Pawan Gupta
  2025-11-22 11:05               ` david laight
  1 sibling, 1 reply; 63+ messages in thread
From: Dave Hansen @ 2025-11-21 21:36 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On 11/21/25 13:26, Pawan Gupta wrote:
> On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
>> On 11/21/25 10:16, Pawan Gupta wrote:
...>>> Also I was preferring constants because load values from global
variables
>>> may also be subject to speculation. Although any speculation should be
>>> corrected before an indirect branch is executed because of the LFENCE after
>>> the sequence.
>>
>> I guess that's a theoretical problem, but it's not a practical one.
> 
> Probably yes. But, load from memory would certainly be slower compared to
> immediates.

Yeah, but it's literally two bytes of data that can almost certainly be
shoved in a cacheline that's also being read on kernel entry. I suspect
it would be hard to show a delta between a memory load and an immediate.

I'd love to see some actual data.

>> So I think we have 4-ish options at this point:
>>
>> 1. Generate the long and short sequences independently and in their
>>    entirety and ALTERNATIVE between them (the original patch)
>> 2. Store the inner/outer loop counts in registers and:
>>   2a. Load those registers from variables
>>   2b. Load them from ALTERNATIVES
> 
> Both of these look to be good options to me.
> 
> 2b. would be my first preference, because it keeps the loop counts as
> inline constants. The resulting sequence stays the same as it is today.
> 
>> 3. Store the inner/outer loop counts in variables in memory
> 
> I could be wrong, but this will likely have non-zero impact on performance.
> I am afraid to cause any regressions in BHI mitigation. That is why I
> preferred the least invasive approach in my previous attempts.

Your magic 8-ball and my crystal ball seem to be disagreeing today.

Time for science!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 21:26             ` Pawan Gupta
  2025-11-21 21:36               ` Dave Hansen
@ 2025-11-22 11:05               ` david laight
  2025-11-24 19:31                 ` Pawan Gupta
  1 sibling, 1 reply; 63+ messages in thread
From: david laight @ 2025-11-22 11:05 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Dave Hansen, Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Fri, 21 Nov 2025 13:26:27 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
> > On 11/21/25 10:16, Pawan Gupta wrote:  
> > > On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:  
> > >> On 11/21/25 08:45, Nikolay Borisov wrote:  
> > >>> OTOH: the global variable approach seems saner as in the macro you'd
> > >>> have direct reference to them and so it will be more obvious how things
> > >>> are setup.  
> > >>
> > >> Oh, yeah, duh. You don't need to pass the variables in registers. They
> > >> could just be read directly.  
> > > 
> > > IIUC, global variables would introduce extra memory loads that may slow
> > > things down. I will try to measure their impact. I think those global
> > > variables should be in the .entry.text section to play well with PTI.  
> > 
> > Really? I didn't look exhaustively, but CLEAR_BRANCH_HISTORY seems to
> > get called pretty close to where the assembly jumps into C. Long after
> > we're running on the kernel CR3.  
> 
> You are right. PTI is not a concern here.
> 
> > > Also I was preferring constants because load values from global variables
> > > may also be subject to speculation. Although any speculation should be
> > > corrected before an indirect branch is executed because of the LFENCE after
> > > the sequence.  
> > 
> > I guess that's a theoretical problem, but it's not a practical one.  
> 
> Probably yes. But, load from memory would certainly be slower compared to
> immediates.
> 
> > So I think we have 4-ish options at this point:
> > 
> > 1. Generate the long and short sequences independently and in their
> >    entirety and ALTERNATIVE between them (the original patch)
> > 2. Store the inner/outer loop counts in registers and:
> >   2a. Load those registers from variables
> >   2b. Load them from ALTERNATIVES  
> 
> Both of these look to be good options to me.
> 
> 2b. would be my first preference, because it keeps the loop counts as
> inline constants. The resulting sequence stays the same as it is today.
> 
> > 3. Store the inner/outer loop counts in variables in memory  
> 
> I could be wrong, but this will likely have non-zero impact on performance.
> I am afraid to cause any regressions in BHI mitigation. That is why I
> preferred the least invasive approach in my previous attempts.

Surely it won't be significant compared to the cost of the loop itself.
That is the bit that really kills performance.

For subtle reasons one of the mitigations that slows kernel entry caused
a doubling of the execution time of a largely single-threaded task that
spends almost all its time in userspace!
(I thought I'd disabled it at compile time - but the config option
changed underneath me...)

	David



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 21:36               ` Dave Hansen
@ 2025-11-24 19:21                 ` Pawan Gupta
  0 siblings, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-24 19:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Fri, Nov 21, 2025 at 01:36:37PM -0800, Dave Hansen wrote:
> On 11/21/25 13:26, Pawan Gupta wrote:
> > On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
> >> On 11/21/25 10:16, Pawan Gupta wrote:
> ...>>> Also I was preferring constants because load values from global
> variables
> >>> may also be subject to speculation. Although any speculation should be
> >>> corrected before an indirect branch is executed because of the LFENCE after
> >>> the sequence.
> >>
> >> I guess that's a theoretical problem, but it's not a practical one.
> > 
> > Probably yes. But, load from memory would certainly be slower compared to
> > immediates.
> 
> Yeah, but it's literally two bytes of data that can almost certainly be
> shoved in a cacheline that's also being read on kernel entry. I suspect
> it would be hard to show a delta between a memory load and an immediate.
> 
> I'd love to see some actual data.

You were right, the perf-tool profiling and the Unixbench results show no
meaningful difference between the two approaches. I was irrationally biased
towards immediates. Making the loop count as global.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-22 11:05               ` david laight
@ 2025-11-24 19:31                 ` Pawan Gupta
  2025-11-25 11:34                   ` david laight
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-24 19:31 UTC (permalink / raw)
  To: david laight
  Cc: Dave Hansen, Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:
> On Fri, 21 Nov 2025 13:26:27 -0800
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> > On Fri, Nov 21, 2025 at 10:42:24AM -0800, Dave Hansen wrote:
> > > On 11/21/25 10:16, Pawan Gupta wrote:  
> > > > On Fri, Nov 21, 2025 at 08:50:17AM -0800, Dave Hansen wrote:  
> > > >> On 11/21/25 08:45, Nikolay Borisov wrote:  
> > > >>> OTOH: the global variable approach seems saner as in the macro you'd
> > > >>> have direct reference to them and so it will be more obvious how things
> > > >>> are setup.  
> > > >>
> > > >> Oh, yeah, duh. You don't need to pass the variables in registers. They
> > > >> could just be read directly.  
> > > > 
> > > > IIUC, global variables would introduce extra memory loads that may slow
> > > > things down. I will try to measure their impact. I think those global
> > > > variables should be in the .entry.text section to play well with PTI.  
> > > 
> > > Really? I didn't look exhaustively, but CLEAR_BRANCH_HISTORY seems to
> > > get called pretty close to where the assembly jumps into C. Long after
> > > we're running on the kernel CR3.  
> > 
> > You are right. PTI is not a concern here.
> > 
> > > > Also I was preferring constants because load values from global variables
> > > > may also be subject to speculation. Although any speculation should be
> > > > corrected before an indirect branch is executed because of the LFENCE after
> > > > the sequence.  
> > > 
> > > I guess that's a theoretical problem, but it's not a practical one.  
> > 
> > Probably yes. But, load from memory would certainly be slower compared to
> > immediates.
> > 
> > > So I think we have 4-ish options at this point:
> > > 
> > > 1. Generate the long and short sequences independently and in their
> > >    entirety and ALTERNATIVE between them (the original patch)
> > > 2. Store the inner/outer loop counts in registers and:
> > >   2a. Load those registers from variables
> > >   2b. Load them from ALTERNATIVES  
> > 
> > Both of these look to be good options to me.
> > 
> > 2b. would be my first preference, because it keeps the loop counts as
> > inline constants. The resulting sequence stays the same as it is today.
> > 
> > > 3. Store the inner/outer loop counts in variables in memory  
> > 
> > I could be wrong, but this will likely have non-zero impact on performance.
> > I am afraid to cause any regressions in BHI mitigation. That is why I
> > preferred the least invasive approach in my previous attempts.
> 
> Surely it won't be significant compared to the cost of the loop itself.
> That is the bit that really kills performance.

Correct, recent data suggests the same.

> For subtle reasons one of the mitigations that slows kernel entry caused
> a doubling of the execution time of a largely single-threaded task that
> spends almost all its time in userspace!
> (I thought I'd disabled it at compile time - but the config option
> changed underneath me...)

That is surprising. If its okay, could you please share more details about
this application? Or any other way I can reproduce this?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch()
  2025-11-21 14:27   ` Nikolay Borisov
@ 2025-11-24 23:09     ` Pawan Gupta
  2025-11-25 10:19       ` Nikolay Borisov
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-11-24 23:09 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang

On Fri, Nov 21, 2025 at 04:27:05PM +0200, Nikolay Borisov wrote:
> 
> 
> On 11/20/25 08:19, Pawan Gupta wrote:
> > This ensures that all mitigation modes are explicitly handled, while
> > keeping the mitigation selection for each mode together. This also prepares
> > for adding BHB-clearing mitigation mode for VMSCAPE.
> > 
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> >   arch/x86/kernel/cpu/bugs.c | 22 ++++++++++++++++++----
> >   1 file changed, 18 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> > index 1e9b11198db0fe2483bd17b1327bcfd44a2c1dbf..233594ede19bf971c999f4d3cc0f6f213002c16c 100644
> > --- a/arch/x86/kernel/cpu/bugs.c
> > +++ b/arch/x86/kernel/cpu/bugs.c
> > @@ -3231,17 +3231,31 @@ early_param("vmscape", vmscape_parse_cmdline);
> >   static void __init vmscape_select_mitigation(void)
> >   {
> > -	if (!boot_cpu_has_bug(X86_BUG_VMSCAPE) ||
> > -	    !boot_cpu_has(X86_FEATURE_IBPB)) {
> > +	if (!boot_cpu_has_bug(X86_BUG_VMSCAPE)) {
> >   		vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> >   		return;
> >   	}
> > -	if (vmscape_mitigation == VMSCAPE_MITIGATION_AUTO) {
> > -		if (should_mitigate_vuln(X86_BUG_VMSCAPE))
> > +	if ((vmscape_mitigation == VMSCAPE_MITIGATION_AUTO) &&
> > +	    !should_mitigate_vuln(X86_BUG_VMSCAPE))
> > +		vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> > +
> > +	switch (vmscape_mitigation) {
> > +	case VMSCAPE_MITIGATION_NONE:
> > +		break;
> > +
> > +	case VMSCAPE_MITIGATION_IBPB_ON_VMEXIT:
> > +	case VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER:
> > +		if (!boot_cpu_has(X86_FEATURE_IBPB))
> > +			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> > +		break;
> > +
> > +	case VMSCAPE_MITIGATION_AUTO:
> > +		if (boot_cpu_has(X86_FEATURE_IBPB))
> >   			vmscape_mitigation = VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER;
> 
> 
> IMO this patch is a net-negative because as per my reply to patch 9 you have
> effectively a dead branch:
> 
> The clear BHB_CLEAR_USER one, however it turns out you have yet another one:
> VMSCAPE_MITIGATION_IBPB_ON_VMEXIT as it's only ever set in
> vmscape_update_mitigation() which executes after '_select()' as well and

Removed VMSCAPE_MITIGATION_IBPB_ON_VMEXIT.

> additionally you duplicate the FEATURE_IBPB check.

FEATURE_IBPB check is still needed for VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER.
I don't think we can drop that.

> So I think either dropping it or removing the superfluous branches is in
> order.
> 
> >   		else
> >   			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
> > +		break;
> >   	}
> >   }
> > 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse
  2025-11-20  6:18 ` [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse Pawan Gupta
  2025-11-20 16:28   ` Nikolay Borisov
@ 2025-11-25  0:21   ` Pawan Gupta
  1 sibling, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-25  0:21 UTC (permalink / raw)
  To: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang

On Wed, Nov 19, 2025 at 10:18:04PM -0800, Pawan Gupta wrote:
> In preparation to make clear_bhb_loop() work for CPUs with larger BHB, move
> the sequence to a macro. This will allow setting the depth of BHB-clearing
> easily via arguments.
> 
> No functional change intended.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>  arch/x86/entry/entry_64.S | 37 +++++++++++++++++++++++--------------
>  1 file changed, 23 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 886f86790b4467347031bc27d3d761d5cc286da1..a62dbc89c5e75b955ebf6d84f20d157d4bce0253 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1499,11 +1499,6 @@ SYM_CODE_END(rewind_stack_and_make_dead)
>   * from the branch history tracker in the Branch Predictor, therefore removing
>   * user influence on subsequent BTB lookups.
>   *
> - * It should be used on parts prior to Alder Lake. Newer parts should use the
> - * BHI_DIS_S hardware control instead. If a pre-Alder Lake part is being
> - * virtualized on newer hardware the VMM should protect against BHI attacks by
> - * setting BHI_DIS_S for the guests.
> - *
>   * CALLs/RETs are necessary to prevent Loop Stream Detector(LSD) from engaging
>   * and not clearing the branch history. The call tree looks like:
>   *
> @@ -1532,10 +1527,7 @@ SYM_CODE_END(rewind_stack_and_make_dead)
>   * Note, callers should use a speculation barrier like LFENCE immediately after
>   * a call to this function to ensure BHB is cleared before indirect branches.
>   */
> -SYM_FUNC_START(clear_bhb_loop)
> -	ANNOTATE_NOENDBR
> -	push	%rbp
> -	mov	%rsp, %rbp
> +.macro	CLEAR_BHB_LOOP_SEQ
>  	movl	$5, %ecx
>  	ANNOTATE_INTRA_FUNCTION_CALL
>  	call	1f
> @@ -1545,15 +1537,16 @@ SYM_FUNC_START(clear_bhb_loop)
>  	 * Shift instructions so that the RET is in the upper half of the
>  	 * cacheline and don't take the slowpath to its_return_thunk.
>  	 */
> -	.skip 32 - (.Lret1 - 1f), 0xcc
> +	.skip 32 - (.Lret1_\@ - 1f), 0xcc
>  	ANNOTATE_INTRA_FUNCTION_CALL
>  1:	call	2f
> -.Lret1:	RET
> +.Lret1_\@:
> +	RET
>  	.align 64, 0xcc
>  	/*
> -	 * As above shift instructions for RET at .Lret2 as well.
> +	 * As above shift instructions for RET at .Lret2_\@ as well.
>  	 *
> -	 * This should be ideally be: .skip 32 - (.Lret2 - 2f), 0xcc
> +	 * This should ideally be: .skip 32 - (.Lret2_\@ - 2f), 0xcc
>  	 * but some Clang versions (e.g. 18) don't like this.
>  	 */
>  	.skip 32 - 18, 0xcc
> @@ -1564,8 +1557,24 @@ SYM_FUNC_START(clear_bhb_loop)
>  	jnz	3b
>  	sub	$1, %ecx
>  	jnz	1b
> -.Lret2:	RET
> +.Lret2_\@:
> +	RET
>  5:
> +.endm
> +
> +/*
> + * This should be used on parts prior to Alder Lake. Newer parts should use the
> + * BHI_DIS_S hardware control instead. If a pre-Alder Lake part is being
> + * virtualized on newer hardware the VMM should protect against BHI attacks by
> + * setting BHI_DIS_S for the guests.
> + */
> +SYM_FUNC_START(clear_bhb_loop)
> +	ANNOTATE_NOENDBR
> +	push	%rbp
> +	mov	%rsp, %rbp
> +
> +	CLEAR_BHB_LOOP_SEQ
> +
>  	pop	%rbp
>  	RET
>  SYM_FUNC_END(clear_bhb_loop)

Dropping this and the next patch, they are not needed with globals for BHB
loop count.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch()
  2025-11-24 23:09     ` Pawan Gupta
@ 2025-11-25 10:19       ` Nikolay Borisov
  2025-11-25 17:45         ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-25 10:19 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/25/25 01:09, Pawan Gupta wrote:
> On Fri, Nov 21, 2025 at 04:27:05PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 11/20/25 08:19, Pawan Gupta wrote:
>>> This ensures that all mitigation modes are explicitly handled, while
>>> keeping the mitigation selection for each mode together. This also prepares
>>> for adding BHB-clearing mitigation mode for VMSCAPE.
>>>
>>> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
>>> ---
>>>    arch/x86/kernel/cpu/bugs.c | 22 ++++++++++++++++++----
>>>    1 file changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
>>> index 1e9b11198db0fe2483bd17b1327bcfd44a2c1dbf..233594ede19bf971c999f4d3cc0f6f213002c16c 100644
>>> --- a/arch/x86/kernel/cpu/bugs.c
>>> +++ b/arch/x86/kernel/cpu/bugs.c
>>> @@ -3231,17 +3231,31 @@ early_param("vmscape", vmscape_parse_cmdline);
>>>    static void __init vmscape_select_mitigation(void)
>>>    {
>>> -	if (!boot_cpu_has_bug(X86_BUG_VMSCAPE) ||
>>> -	    !boot_cpu_has(X86_FEATURE_IBPB)) {
>>> +	if (!boot_cpu_has_bug(X86_BUG_VMSCAPE)) {
>>>    		vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
>>>    		return;
>>>    	}
>>> -	if (vmscape_mitigation == VMSCAPE_MITIGATION_AUTO) {
>>> -		if (should_mitigate_vuln(X86_BUG_VMSCAPE))
>>> +	if ((vmscape_mitigation == VMSCAPE_MITIGATION_AUTO) &&
>>> +	    !should_mitigate_vuln(X86_BUG_VMSCAPE))
>>> +		vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
>>> +
>>> +	switch (vmscape_mitigation) {
>>> +	case VMSCAPE_MITIGATION_NONE:
>>> +		break;
>>> +
>>> +	case VMSCAPE_MITIGATION_IBPB_ON_VMEXIT:
>>> +	case VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER:
>>> +		if (!boot_cpu_has(X86_FEATURE_IBPB))
>>> +			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
>>> +		break;
>>> +
>>> +	case VMSCAPE_MITIGATION_AUTO:
>>> +		if (boot_cpu_has(X86_FEATURE_IBPB))
>>>    			vmscape_mitigation = VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER;
>>
>>
>> IMO this patch is a net-negative because as per my reply to patch 9 you have
>> effectively a dead branch:
>>
>> The clear BHB_CLEAR_USER one, however it turns out you have yet another one:
>> VMSCAPE_MITIGATION_IBPB_ON_VMEXIT as it's only ever set in
>> vmscape_update_mitigation() which executes after '_select()' as well and
> 
> Removed VMSCAPE_MITIGATION_IBPB_ON_VMEXIT.
> 
>> additionally you duplicate the FEATURE_IBPB check.
> 
> FEATURE_IBPB check is still needed for VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER.
> I don't think we can drop that.

But if X86_FEATURE_IBPB is not present then all branches boil down to 
setting the mitigation to NONE. What I was suggesting is to not remove 
the that check at the top.

> 
>> So I think either dropping it or removing the superfluous branches is in
>> order.
>>
>>>    		else
>>>    			vmscape_mitigation = VMSCAPE_MITIGATION_NONE;
>>> +		break;
>>>    	}
>>>    }
>>>
>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-24 19:31                 ` Pawan Gupta
@ 2025-11-25 11:34                   ` david laight
  2025-12-04  1:40                     ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: david laight @ 2025-11-25 11:34 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Dave Hansen, Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Mon, 24 Nov 2025 11:31:26 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:
...
> > For subtle reasons one of the mitigations that slows kernel entry caused
> > a doubling of the execution time of a largely single-threaded task that
> > spends almost all its time in userspace!
> > (I thought I'd disabled it at compile time - but the config option
> > changed underneath me...)  
> 
> That is surprising. If its okay, could you please share more details about
> this application? Or any other way I can reproduce this?

The 'trigger' program is a multi-threaded program that wakes up every 10ms
to process RTP and TDM audio data.
So we have a low RT priority process with one thread per cpu.
Since they are RT they usually get scheduled on the same cpu as last lime.
I think this simple program will have the desired effect:
A main process that does:
	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
	start_time += 1sec;
	for (n = 1; n < num_cpu; n++)
		pthread_create(thread_code, start_time);
	thread_code(start_time);
with:
thread_code(ts)
{
	for (;;) {
		ts += 10ms;
		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
		do_work();
	}

So all the threads wake up at exactly the same time every 10ms.
(You need to use syscall(), don't look at what glibc does.)

On my system the program wasn't doing anything, so do_work() was empty.
What matters is whether all the threads end up running at the same time.
I managed that using pthread_broadcast(), but the clock code above
ought to be worse (and I've since changed the daemon to work that way
to avoid all this issues with pthread_broadcast() being sequential
and threads not running because the target cpu is running an ISR or
just looping in kernel).

The process that gets 'hit' is anything cpu bound.
Even a shell loop (eg while :; do ;: done) but with a counter will do.

Without the 'trigger' program, it will (mostly) sit on one cpu and the
clock frequency of that cpu will increase to (say) 3GHz while the other
all run at 800Mhz.
But the 'trigger' program runs threads on all the cpu at the same time.
So the 'hit' program is pre-empted and is later rescheduled on a
different cpu - running at 800MHz.
The cpu speed increases, but 10ms later it gets bounced again.

The real issue is that the cpu speed is associated with the cpu, not
the process running on it.

	David

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 11/11] x86/vmscape: Add cmdline vmscape=on to override attack vector controls
  2025-11-20  6:20 ` [PATCH v4 11/11] x86/vmscape: Add cmdline vmscape=on to override attack vector controls Pawan Gupta
@ 2025-11-25 11:41   ` Nikolay Borisov
  0 siblings, 0 replies; 63+ messages in thread
From: Nikolay Borisov @ 2025-11-25 11:41 UTC (permalink / raw)
  To: Pawan Gupta, x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen
  Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang



On 11/20/25 08:20, Pawan Gupta wrote:
> In general, individual mitigation controls can be used to override the
> attack vector controls. But, nothing exists to select BHB clearing
> mitigation for VMSCAPE. The =force option comes close, but with a
> side-effect of also forcibly setting the bug, hence deploying the
> mitigation on unaffected parts too.
> 
> Add a new cmdline option vmscape=on to enable the mitigation based on the
> VMSCAPE variant the CPU is affected by.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch()
  2025-11-25 10:19       ` Nikolay Borisov
@ 2025-11-25 17:45         ` Pawan Gupta
  0 siblings, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-25 17:45 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: x86, David Kaplan, H. Peter Anvin, Josh Poimboeuf,
	Sean Christopherson, Paolo Bonzini, Borislav Petkov, Dave Hansen,
	linux-kernel, kvm, Asit Mallick, Tao Zhang

On Tue, Nov 25, 2025 at 12:19:32PM +0200, Nikolay Borisov wrote:
> > FEATURE_IBPB check is still needed for VMSCAPE_MITIGATION_IBPB_EXIT_TO_USER.
> > I don't think we can drop that.
> 
> But if X86_FEATURE_IBPB is not present then all branches boil down to
> setting the mitigation to NONE. What I was suggesting is to not remove the
> that check at the top.

BHB_CLEAR mitigation is still possible without IBPB, with that IBPB check cannot
be at the top. This patch prepares for adding BHB_CLEAR support.

Sure I can delay moving the IBPB check to later patch, but the intent of
splitting the patches was to keep the patch that move the existing logic
separate from the one that adds a new mitigation.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-21 16:40   ` Dave Hansen
  2025-11-21 16:45     ` Nikolay Borisov
@ 2025-11-26 19:23     ` Pawan Gupta
  1 sibling, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2025-11-26 19:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Fri, Nov 21, 2025 at 08:40:44AM -0800, Dave Hansen wrote:
> On 11/19/25 22:18, Pawan Gupta wrote:
> > -	CLEAR_BHB_LOOP_SEQ 5, 5
> > +	/* loop count differs based on CPU-gen, see Intel's BHI guidance */
> > +	ALTERNATIVE (CLEAR_BHB_LOOP_SEQ 5, 5),  \
> > +		    __stringify(CLEAR_BHB_LOOP_SEQ 12, 7), X86_FEATURE_BHI_CTRL
> 
> There are a million ways to skin this cat. But I'm not sure I really
> like the end result here. It seems a little overkill to use ALTERNATIVE
> to rewrite a whole sequence just to patch two constants in there.
> 
> What if the CLEAR_BHB_LOOP_SEQ just took its inner and outer loop counts
> as register arguments? Then this would look more like:
> 
> 	ALTERNATIVE "mov  $5, %rdi; mov $5, %rsi",
> 		    "mov $12, %rdi; mov $7, %rsi",
> 	...
> 
> 	CLEAR_BHB_LOOP_SEQ

Following this idea, loop count can be set via ALTERNATIVE within
clear_bhb_loop() itself. The outer count %ecx is already set outside the
loops. The only change to the sequence would be to also store inner count
in a register, and reload %eax from it.

---
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 886f86790b44..e4863d6d3217 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1536,7 +1536,11 @@ SYM_FUNC_START(clear_bhb_loop)
 	ANNOTATE_NOENDBR
 	push	%rbp
 	mov	%rsp, %rbp
-	movl	$5, %ecx
+
+	/* loop count differs based on BHI_CTRL, see Intel's BHI guidance */
+	ALTERNATIVE "movl $5,  %ecx; movl $5, %edx;",	\
+		    "movl $12, %ecx; movl $7, %edx;", X86_FEATURE_BHI_CTRL
+
 	ANNOTATE_INTRA_FUNCTION_CALL
 	call	1f
 	jmp	5f
@@ -1557,7 +1561,7 @@ SYM_FUNC_START(clear_bhb_loop)
 	 * but some Clang versions (e.g. 18) don't like this.
 	 */
 	.skip 32 - 18, 0xcc
-2:	movl	$5, %eax
+2:	movl	%edx, %eax
 3:	jmp	4f
 	nop
 4:	sub	$1, %eax

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-25 11:34                   ` david laight
@ 2025-12-04  1:40                     ` Pawan Gupta
  2025-12-04  9:15                       ` david laight
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-12-04  1:40 UTC (permalink / raw)
  To: david laight
  Cc: Dave Hansen, Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:
> On Mon, 24 Nov 2025 11:31:26 -0800
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:
> ...
> > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > a doubling of the execution time of a largely single-threaded task that
> > > spends almost all its time in userspace!
> > > (I thought I'd disabled it at compile time - but the config option
> > > changed underneath me...)  
> > 
> > That is surprising. If its okay, could you please share more details about
> > this application? Or any other way I can reproduce this?
> 
> The 'trigger' program is a multi-threaded program that wakes up every 10ms
> to process RTP and TDM audio data.
> So we have a low RT priority process with one thread per cpu.
> Since they are RT they usually get scheduled on the same cpu as last lime.
> I think this simple program will have the desired effect:
> A main process that does:
> 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> 	start_time += 1sec;
> 	for (n = 1; n < num_cpu; n++)
> 		pthread_create(thread_code, start_time);
> 	thread_code(start_time);
> with:
> thread_code(ts)
> {
> 	for (;;) {
> 		ts += 10ms;
> 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> 		do_work();
> 	}
> 
> So all the threads wake up at exactly the same time every 10ms.
> (You need to use syscall(), don't look at what glibc does.)
> 
> On my system the program wasn't doing anything, so do_work() was empty.
> What matters is whether all the threads end up running at the same time.
> I managed that using pthread_broadcast(), but the clock code above
> ought to be worse (and I've since changed the daemon to work that way
> to avoid all this issues with pthread_broadcast() being sequential
> and threads not running because the target cpu is running an ISR or
> just looping in kernel).
> 
> The process that gets 'hit' is anything cpu bound.
> Even a shell loop (eg while :; do ;: done) but with a counter will do.
> 
> Without the 'trigger' program, it will (mostly) sit on one cpu and the
> clock frequency of that cpu will increase to (say) 3GHz while the other
> all run at 800Mhz.
> But the 'trigger' program runs threads on all the cpu at the same time.
> So the 'hit' program is pre-empted and is later rescheduled on a
> different cpu - running at 800MHz.
> The cpu speed increases, but 10ms later it gets bounced again.

Sorry I haven't tried creating this test yet.

> The real issue is that the cpu speed is associated with the cpu, not
> the process running on it.

So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
then we don't expect a dramatic performance drop? Setting scaling_governor
to "performance" would be an interesting test.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-12-04  1:40                     ` Pawan Gupta
@ 2025-12-04  9:15                       ` david laight
  2025-12-04 21:56                         ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: david laight @ 2025-12-04  9:15 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Dave Hansen, Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Wed, 3 Dec 2025 17:40:26 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:
> > On Mon, 24 Nov 2025 11:31:26 -0800
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >   
> > > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:  
> > ...  
> > > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > > a doubling of the execution time of a largely single-threaded task that
> > > > spends almost all its time in userspace!
> > > > (I thought I'd disabled it at compile time - but the config option
> > > > changed underneath me...)    
> > > 
> > > That is surprising. If its okay, could you please share more details about
> > > this application? Or any other way I can reproduce this?  
> > 
> > The 'trigger' program is a multi-threaded program that wakes up every 10ms
> > to process RTP and TDM audio data.
> > So we have a low RT priority process with one thread per cpu.
> > Since they are RT they usually get scheduled on the same cpu as last lime.
> > I think this simple program will have the desired effect:
> > A main process that does:
> > 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> > 	start_time += 1sec;
> > 	for (n = 1; n < num_cpu; n++)
> > 		pthread_create(thread_code, start_time);
> > 	thread_code(start_time);
> > with:
> > thread_code(ts)
> > {
> > 	for (;;) {
> > 		ts += 10ms;
> > 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > 		do_work();
> > 	}
> > 
> > So all the threads wake up at exactly the same time every 10ms.
> > (You need to use syscall(), don't look at what glibc does.)
> > 
> > On my system the program wasn't doing anything, so do_work() was empty.
> > What matters is whether all the threads end up running at the same time.
> > I managed that using pthread_broadcast(), but the clock code above
> > ought to be worse (and I've since changed the daemon to work that way
> > to avoid all this issues with pthread_broadcast() being sequential
> > and threads not running because the target cpu is running an ISR or
> > just looping in kernel).
> > 
> > The process that gets 'hit' is anything cpu bound.
> > Even a shell loop (eg while :; do ;: done) but with a counter will do.
> > 
> > Without the 'trigger' program, it will (mostly) sit on one cpu and the
> > clock frequency of that cpu will increase to (say) 3GHz while the other
> > all run at 800Mhz.
> > But the 'trigger' program runs threads on all the cpu at the same time.
> > So the 'hit' program is pre-empted and is later rescheduled on a
> > different cpu - running at 800MHz.
> > The cpu speed increases, but 10ms later it gets bounced again.  
> 
> Sorry I haven't tried creating this test yet.
> 
> > The real issue is that the cpu speed is associated with the cpu, not
> > the process running on it.  
> 
> So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
> then we don't expect a dramatic performance drop? Setting scaling_governor
> to "performance" would be an interesting test.

I failed to find a way to lock the cpu frequency (for other testing) on
that system an i7-7xxx - and the system will start thermally throttling
if you aren't careful.
ISTR that the hardware does most of the work.
So I'm not sure what difference "performance" makes (and can't remember what
might be set for that system - could set set anyway.)
We did have to disable some of the low power states, waking the cpu from those
just takes far too long.

	David

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-12-04  9:15                       ` david laight
@ 2025-12-04 21:56                         ` Pawan Gupta
  2025-12-05  9:21                           ` david laight
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2025-12-04 21:56 UTC (permalink / raw)
  To: david laight
  Cc: Dave Hansen, Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Thu, Dec 04, 2025 at 09:15:11AM +0000, david laight wrote:
> On Wed, 3 Dec 2025 17:40:26 -0800
> Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> > On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:
> > > On Mon, 24 Nov 2025 11:31:26 -0800
> > > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> > >   
> > > > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:  
> > > ...  
> > > > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > > > a doubling of the execution time of a largely single-threaded task that
> > > > > spends almost all its time in userspace!
> > > > > (I thought I'd disabled it at compile time - but the config option
> > > > > changed underneath me...)    
> > > > 
> > > > That is surprising. If its okay, could you please share more details about
> > > > this application? Or any other way I can reproduce this?  
> > > 
> > > The 'trigger' program is a multi-threaded program that wakes up every 10ms
> > > to process RTP and TDM audio data.
> > > So we have a low RT priority process with one thread per cpu.
> > > Since they are RT they usually get scheduled on the same cpu as last lime.
> > > I think this simple program will have the desired effect:
> > > A main process that does:
> > > 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> > > 	start_time += 1sec;
> > > 	for (n = 1; n < num_cpu; n++)
> > > 		pthread_create(thread_code, start_time);
> > > 	thread_code(start_time);
> > > with:
> > > thread_code(ts)
> > > {
> > > 	for (;;) {
> > > 		ts += 10ms;
> > > 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > 		do_work();
> > > 	}
> > > 
> > > So all the threads wake up at exactly the same time every 10ms.
> > > (You need to use syscall(), don't look at what glibc does.)
> > > 
> > > On my system the program wasn't doing anything, so do_work() was empty.
> > > What matters is whether all the threads end up running at the same time.
> > > I managed that using pthread_broadcast(), but the clock code above
> > > ought to be worse (and I've since changed the daemon to work that way
> > > to avoid all this issues with pthread_broadcast() being sequential
> > > and threads not running because the target cpu is running an ISR or
> > > just looping in kernel).
> > > 
> > > The process that gets 'hit' is anything cpu bound.
> > > Even a shell loop (eg while :; do ;: done) but with a counter will do.
> > > 
> > > Without the 'trigger' program, it will (mostly) sit on one cpu and the
> > > clock frequency of that cpu will increase to (say) 3GHz while the other
> > > all run at 800Mhz.
> > > But the 'trigger' program runs threads on all the cpu at the same time.
> > > So the 'hit' program is pre-empted and is later rescheduled on a
> > > different cpu - running at 800MHz.
> > > The cpu speed increases, but 10ms later it gets bounced again.  
> > 
> > Sorry I haven't tried creating this test yet.
> > 
> > > The real issue is that the cpu speed is associated with the cpu, not
> > > the process running on it.  
> > 
> > So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
> > then we don't expect a dramatic performance drop? Setting scaling_governor
> > to "performance" would be an interesting test.
> 
> I failed to find a way to lock the cpu frequency (for other testing) on
> that system an i7-7xxx - and the system will start thermally throttling
> if you aren't careful.

i7-7xxx would be Kaby Lake gen, those shouldn't need to deploy BHB clear
mitigation. I am guessing it is the legacy-IBRS mitigation in your case.

What you described looks very similar to the issue fixed by commit:

  aa1567a7e644 ("intel_idle: Add ibrs_off module parameter to force-disable IBRS")

    Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle")
    disables IBRS when the cstate is 6 or lower. However, there are
    some use cases where a customer may want to use max_cstate=1 to
    lower latency. Such use cases will suffer from the performance
    degradation caused by the enabling of IBRS in the sibling idle thread.
    Add a "ibrs_off" module parameter to force disable IBRS and the
    CPUIDLE_FLAG_IRQ_ENABLE flag if set.

    In the case of a Skylake server with max_cstate=1, this new ibrs_off
    option will likely increase the IRQ response latency as IRQ will now
    be disabled.

    When running SPECjbb2015 with cstates set to C1 on a Skylake system.

    First test when the kernel is booted with: "intel_idle.ibrs_off":

      max-jOPS = 117828, critical-jOPS = 66047

    Then retest when the kernel is booted without the "intel_idle.ibrs_off"
    added:

      max-jOPS = 116408, critical-jOPS = 58958

    That means booting with "intel_idle.ibrs_off" improves performance by:

      max-jOPS:      +1.2%, which could be considered noise range.
      critical-jOPS: +12%,  which is definitely a solid improvement.

> ISTR that the hardware does most of the work.
> So I'm not sure what difference "performance" makes (and can't remember what
> might be set for that system - could set set anyway.)

> We did have to disable some of the low power states, waking the cpu from those
> just takes far too long.

Seems like you have a workaround in place already.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-12-04 21:56                         ` Pawan Gupta
@ 2025-12-05  9:21                           ` david laight
  0 siblings, 0 replies; 63+ messages in thread
From: david laight @ 2025-12-05  9:21 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Dave Hansen, Nikolay Borisov, x86, David Kaplan, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, Peter Zijlstra

On Thu, 4 Dec 2025 13:56:02 -0800
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:

> On Thu, Dec 04, 2025 at 09:15:11AM +0000, david laight wrote:
> > On Wed, 3 Dec 2025 17:40:26 -0800
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >   
> > > On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:  
> > > > On Mon, 24 Nov 2025 11:31:26 -0800
> > > > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> > > >     
> > > > > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:    
> > > > ...    
> > > > > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > > > > a doubling of the execution time of a largely single-threaded task that
> > > > > > spends almost all its time in userspace!
> > > > > > (I thought I'd disabled it at compile time - but the config option
> > > > > > changed underneath me...)      
> > > > > 
> > > > > That is surprising. If its okay, could you please share more details about
> > > > > this application? Or any other way I can reproduce this?    
> > > > 
> > > > The 'trigger' program is a multi-threaded program that wakes up every 10ms
> > > > to process RTP and TDM audio data.
> > > > So we have a low RT priority process with one thread per cpu.
> > > > Since they are RT they usually get scheduled on the same cpu as last lime.
> > > > I think this simple program will have the desired effect:
> > > > A main process that does:
> > > > 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> > > > 	start_time += 1sec;
> > > > 	for (n = 1; n < num_cpu; n++)
> > > > 		pthread_create(thread_code, start_time);
> > > > 	thread_code(start_time);
> > > > with:
> > > > thread_code(ts)
> > > > {
> > > > 	for (;;) {
> > > > 		ts += 10ms;
> > > > 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > > 		do_work();
> > > > 	}
> > > > 
> > > > So all the threads wake up at exactly the same time every 10ms.
> > > > (You need to use syscall(), don't look at what glibc does.)
> > > > 
> > > > On my system the program wasn't doing anything, so do_work() was empty.
> > > > What matters is whether all the threads end up running at the same time.
> > > > I managed that using pthread_broadcast(), but the clock code above
> > > > ought to be worse (and I've since changed the daemon to work that way
> > > > to avoid all this issues with pthread_broadcast() being sequential
> > > > and threads not running because the target cpu is running an ISR or
> > > > just looping in kernel).
> > > > 
> > > > The process that gets 'hit' is anything cpu bound.
> > > > Even a shell loop (eg while :; do ;: done) but with a counter will do.
> > > > 
> > > > Without the 'trigger' program, it will (mostly) sit on one cpu and the
> > > > clock frequency of that cpu will increase to (say) 3GHz while the other
> > > > all run at 800Mhz.
> > > > But the 'trigger' program runs threads on all the cpu at the same time.
> > > > So the 'hit' program is pre-empted and is later rescheduled on a
> > > > different cpu - running at 800MHz.
> > > > The cpu speed increases, but 10ms later it gets bounced again.    
> > > 
> > > Sorry I haven't tried creating this test yet.
> > >   
> > > > The real issue is that the cpu speed is associated with the cpu, not
> > > > the process running on it.    
> > > 
> > > So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
> > > then we don't expect a dramatic performance drop? Setting scaling_governor
> > > to "performance" would be an interesting test.  
> > 
> > I failed to find a way to lock the cpu frequency (for other testing) on
> > that system an i7-7xxx - and the system will start thermally throttling
> > if you aren't careful.  
> 
> i7-7xxx would be Kaby Lake gen, those shouldn't need to deploy BHB clear
> mitigation. I am guessing it is the legacy-IBRS mitigation in your case.
> 
> What you described looks very similar to the issue fixed by commit:
> 
>   aa1567a7e644 ("intel_idle: Add ibrs_off module parameter to force-disable IBRS")
> 
>     Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle")
>     disables IBRS when the cstate is 6 or lower. However, there are
>     some use cases where a customer may want to use max_cstate=1 to
>     lower latency. Such use cases will suffer from the performance
>     degradation caused by the enabling of IBRS in the sibling idle thread.
>     Add a "ibrs_off" module parameter to force disable IBRS and the
>     CPUIDLE_FLAG_IRQ_ENABLE flag if set.
> 
>     In the case of a Skylake server with max_cstate=1, this new ibrs_off
>     option will likely increase the IRQ response latency as IRQ will now
>     be disabled.
> 
>     When running SPECjbb2015 with cstates set to C1 on a Skylake system.
> 
>     First test when the kernel is booted with: "intel_idle.ibrs_off":
> 
>       max-jOPS = 117828, critical-jOPS = 66047
> 
>     Then retest when the kernel is booted without the "intel_idle.ibrs_off"
>     added:
> 
>       max-jOPS = 116408, critical-jOPS = 58958
> 
>     That means booting with "intel_idle.ibrs_off" improves performance by:
> 
>       max-jOPS:      +1.2%, which could be considered noise range.
>       critical-jOPS: +12%,  which is definitely a solid improvement.

No, it wasn't anything to do with sibling threads.
It was the simple issue of the single-threaded 'busy in userspace' program
getting migrated to an idle cpu running at a low priority.
The IBRS mitigation just affected the timings of the other processes in the
system enough to force the user thread be pre-empted and rescheduled.

So it was not directly related to this code - even though it caused it.
The real issues is the cpu speed being tied to the physical cpu, not the
thread running on it.

> 
> > ISTR that the hardware does most of the work.
> > So I'm not sure what difference "performance" makes (and can't remember what
> > might be set for that system - could set set anyway.)  
> 
> > We did have to disable some of the low power states, waking the cpu from those
> > just takes far too long.  
> 
> Seems like you have a workaround in place already.

I just needed to find out why my fpga compile had gone out from 12 minutes
to over 20 with a kernel update.
Fixing that was easy, but the 'busy thread being migrated to an idle cpu'
is a separate issue that could affect a lot of workloads.
(Whether or not these mitigations are in place.)
Diagnosing it required looking at the scheduler ftrace events and then
realising what effect they would have.
It wouldn't surprise me if people haven't 'fixed' the problem by pinning
a process to a specific cpu, I couldn't try that because the fpga compiler
has some multithreaded parts.

	David



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2025-11-20  6:18 ` [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs Pawan Gupta
  2025-11-21 12:33   ` Nikolay Borisov
  2025-11-21 16:40   ` Dave Hansen
@ 2026-03-06 21:00   ` Jim Mattson
  2026-03-06 22:32     ` Pawan Gupta
  2 siblings, 1 reply; 63+ messages in thread
From: Jim Mattson @ 2026-03-06 21:00 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn

On Wed, Nov 19, 2025 at 10:19 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> the Branch History Buffer (BHB). On Alder Lake and newer parts this
> sequence is not sufficient because it doesn't clear enough entries. This
> was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> that mitigates BHI in kernel.
>
> BHI variant of VMSCAPE requires isolating branch history between guests and
> userspace. Note that there is no equivalent hardware control for userspace.
> To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> should execute sufficient number of branches to clear a larger BHB.
>
> Dynamically set the loop count of clear_bhb_loop() such that it is
> effective on newer CPUs too. Use the hardware control enumeration
> X86_FEATURE_BHI_CTRL to select the appropriate loop count.

I didn't speak up earlier, because I have always considered the change
in MAXPHYADDR from ICX to SPR a hard barrier for virtual machines
masquerading as a different platform. Sadly, I am now losing that
battle. :(

If a heterogeneous migration pool includes hosts with and without
BHI_CTRL, then BHI_CTRL cannot be advertised to a guest, because it is
not possible to emulate BHI_DIS_S on a host that doesn't have it.
Hence, one cannot derive the size of the BHB from the existence of
this feature bit.

I think we need an explicit CPUID bit that a hypervisor can set to
indicate that the underlying hardware might be SPR or later.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-06 21:00   ` Jim Mattson
@ 2026-03-06 22:32     ` Pawan Gupta
  2026-03-06 22:57       ` Jim Mattson
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2026-03-06 22:32 UTC (permalink / raw)
  To: Jim Mattson
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn

On Fri, Mar 06, 2026 at 01:00:15PM -0800, Jim Mattson wrote:
> On Wed, Nov 19, 2025 at 10:19 PM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > that mitigates BHI in kernel.
> >
> > BHI variant of VMSCAPE requires isolating branch history between guests and
> > userspace. Note that there is no equivalent hardware control for userspace.
> > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > should execute sufficient number of branches to clear a larger BHB.
> >
> > Dynamically set the loop count of clear_bhb_loop() such that it is
> > effective on newer CPUs too. Use the hardware control enumeration
> > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> 
> I didn't speak up earlier, because I have always considered the change
> in MAXPHYADDR from ICX to SPR a hard barrier for virtual machines
> masquerading as a different platform. Sadly, I am now losing that
> battle. :(
> 
> If a heterogeneous migration pool includes hosts with and without
> BHI_CTRL, then BHI_CTRL cannot be advertised to a guest, because it is
> not possible to emulate BHI_DIS_S on a host that doesn't have it.
> Hence, one cannot derive the size of the BHB from the existence of
> this feature bit.

As far as VMSCAPE mitigation is concerned, mitigation is done by the host
so enumeration of BHI_CTRL is not a problem. The issue that you are
refering to exists with or without this patch.

I suppose your point is in the context of Native BHI mitigation for the
guests.

> I think we need an explicit CPUID bit that a hypervisor can set to
> indicate that the underlying hardware might be SPR or later.

Something similar was attempted via virtual-MSRs in the below series:

[RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/

Do you think a rework of this approach would help?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-06 22:32     ` Pawan Gupta
@ 2026-03-06 22:57       ` Jim Mattson
  2026-03-06 23:29         ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Jim Mattson @ 2026-03-06 22:57 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn

On Fri, Mar 6, 2026 at 2:32 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Fri, Mar 06, 2026 at 01:00:15PM -0800, Jim Mattson wrote:
> > On Wed, Nov 19, 2025 at 10:19 PM Pawan Gupta
> > <pawan.kumar.gupta@linux.intel.com> wrote:
> > >
> > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > sequence is not sufficient because it doesn't clear enough entries. This
> > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > that mitigates BHI in kernel.
> > >
> > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > userspace. Note that there is no equivalent hardware control for userspace.
> > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > should execute sufficient number of branches to clear a larger BHB.
> > >
> > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > effective on newer CPUs too. Use the hardware control enumeration
> > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> >
> > I didn't speak up earlier, because I have always considered the change
> > in MAXPHYADDR from ICX to SPR a hard barrier for virtual machines
> > masquerading as a different platform. Sadly, I am now losing that
> > battle. :(
> >
> > If a heterogeneous migration pool includes hosts with and without
> > BHI_CTRL, then BHI_CTRL cannot be advertised to a guest, because it is
> > not possible to emulate BHI_DIS_S on a host that doesn't have it.
> > Hence, one cannot derive the size of the BHB from the existence of
> > this feature bit.
>
> As far as VMSCAPE mitigation is concerned, mitigation is done by the host
> so enumeration of BHI_CTRL is not a problem. The issue that you are
> refering to exists with or without this patch.

The hypervisor *should* set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
behalf when BHI_CTRL is not advertised to the guest. However, this
doesn't actually happen today. KVM does not support the tertiary
processor-based VM-execution controls bit 7 (virtualize
IA32_SPEC_CTRL), and KVM cedes the IA32_SPEC_CTRL MSR to the guest on
the first non-zero write.

> I suppose your point is in the context of Native BHI mitigation for the
> guests.

Specific vulnerabilities aside, my point is that one cannot infer
anything about the underlying hardware from the presence or absence of
BHI_CTRL in a VM.

> > I think we need an explicit CPUID bit that a hypervisor can set to
> > indicate that the underlying hardware might be SPR or later.
>
> Something similar was attempted via virtual-MSRs in the below series:
>
> [RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
> https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/
>
> Do you think a rework of this approach would help?

No, I think that whole idea is ill-conceived.  As I said above, the
hypervisor should just set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
behalf when BHI_CTRL is not advertised to the guest. I don't see any
value in predicating this mitigation on guest usage of the short BHB
clearing sequence. Just do it.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-06 22:57       ` Jim Mattson
@ 2026-03-06 23:29         ` Pawan Gupta
  2026-03-07  0:35           ` Jim Mattson
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2026-03-06 23:29 UTC (permalink / raw)
  To: Jim Mattson
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn

On Fri, Mar 06, 2026 at 02:57:13PM -0800, Jim Mattson wrote:
> On Fri, Mar 6, 2026 at 2:32 PM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > On Fri, Mar 06, 2026 at 01:00:15PM -0800, Jim Mattson wrote:
> > > On Wed, Nov 19, 2025 at 10:19 PM Pawan Gupta
> > > <pawan.kumar.gupta@linux.intel.com> wrote:
> > > >
> > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > > sequence is not sufficient because it doesn't clear enough entries. This
> > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > > that mitigates BHI in kernel.
> > > >
> > > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > > userspace. Note that there is no equivalent hardware control for userspace.
> > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > > should execute sufficient number of branches to clear a larger BHB.
> > > >
> > > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > > effective on newer CPUs too. Use the hardware control enumeration
> > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > >
> > > I didn't speak up earlier, because I have always considered the change
> > > in MAXPHYADDR from ICX to SPR a hard barrier for virtual machines
> > > masquerading as a different platform. Sadly, I am now losing that
> > > battle. :(
> > >
> > > If a heterogeneous migration pool includes hosts with and without
> > > BHI_CTRL, then BHI_CTRL cannot be advertised to a guest, because it is
> > > not possible to emulate BHI_DIS_S on a host that doesn't have it.
> > > Hence, one cannot derive the size of the BHB from the existence of
> > > this feature bit.
> >
> > As far as VMSCAPE mitigation is concerned, mitigation is done by the host
> > so enumeration of BHI_CTRL is not a problem. The issue that you are
> > refering to exists with or without this patch.
> 
> The hypervisor *should* set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> behalf when BHI_CTRL is not advertised to the guest. However, this
> doesn't actually happen today. KVM does not support the tertiary
> processor-based VM-execution controls bit 7 (virtualize
> IA32_SPEC_CTRL), and KVM cedes the IA32_SPEC_CTRL MSR to the guest on
> the first non-zero write.

The first half of the series adds the support for virtualizing
IA32_SPEC_CTRL. Atleast that part is worth reconsidering.

https://lore.kernel.org/lkml/20240410143446.797262-1-chao.gao@intel.com/

> > I suppose your point is in the context of Native BHI mitigation for the
> > guests.
> 
> Specific vulnerabilities aside, my point is that one cannot infer
> anything about the underlying hardware from the presence or absence of
> BHI_CTRL in a VM.

Agree.

> > > I think we need an explicit CPUID bit that a hypervisor can set to
> > > indicate that the underlying hardware might be SPR or later.
> >
> > Something similar was attempted via virtual-MSRs in the below series:
> >
> > [RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
> > https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/
> >
> > Do you think a rework of this approach would help?
> 
> No, I think that whole idea is ill-conceived.  As I said above, the
> hypervisor should just set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> behalf when BHI_CTRL is not advertised to the guest. I don't see any
> value in predicating this mitigation on guest usage of the short BHB
> clearing sequence. Just do it.

There are cases where this would be detrimental:

1. A guest disabling the mitigation in favor of performance.
2. A guest deploying the long SW sequence would suffer from two mitigations
   for the same vulnerability.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-06 23:29         ` Pawan Gupta
@ 2026-03-07  0:35           ` Jim Mattson
  2026-03-07  1:00             ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Jim Mattson @ 2026-03-07  0:35 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn

On Fri, Mar 6, 2026 at 3:29 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Fri, Mar 06, 2026 at 02:57:13PM -0800, Jim Mattson wrote:
> > On Fri, Mar 6, 2026 at 2:32 PM Pawan Gupta
> > <pawan.kumar.gupta@linux.intel.com> wrote:
> > >
> > > On Fri, Mar 06, 2026 at 01:00:15PM -0800, Jim Mattson wrote:
> > > > On Wed, Nov 19, 2025 at 10:19 PM Pawan Gupta
> > > > <pawan.kumar.gupta@linux.intel.com> wrote:
> > > > >
> > > > > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrites
> > > > > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > > > > sequence is not sufficient because it doesn't clear enough entries. This
> > > > > was not an issue because these CPUs have a hardware control (BHI_DIS_S)
> > > > > that mitigates BHI in kernel.
> > > > >
> > > > > BHI variant of VMSCAPE requires isolating branch history between guests and
> > > > > userspace. Note that there is no equivalent hardware control for userspace.
> > > > > To effectively isolate branch history on newer CPUs, clear_bhb_loop()
> > > > > should execute sufficient number of branches to clear a larger BHB.
> > > > >
> > > > > Dynamically set the loop count of clear_bhb_loop() such that it is
> > > > > effective on newer CPUs too. Use the hardware control enumeration
> > > > > X86_FEATURE_BHI_CTRL to select the appropriate loop count.
> > > >
> > > > I didn't speak up earlier, because I have always considered the change
> > > > in MAXPHYADDR from ICX to SPR a hard barrier for virtual machines
> > > > masquerading as a different platform. Sadly, I am now losing that
> > > > battle. :(
> > > >
> > > > If a heterogeneous migration pool includes hosts with and without
> > > > BHI_CTRL, then BHI_CTRL cannot be advertised to a guest, because it is
> > > > not possible to emulate BHI_DIS_S on a host that doesn't have it.
> > > > Hence, one cannot derive the size of the BHB from the existence of
> > > > this feature bit.
> > >
> > > As far as VMSCAPE mitigation is concerned, mitigation is done by the host
> > > so enumeration of BHI_CTRL is not a problem. The issue that you are
> > > refering to exists with or without this patch.
> >
> > The hypervisor *should* set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> > behalf when BHI_CTRL is not advertised to the guest. However, this
> > doesn't actually happen today. KVM does not support the tertiary
> > processor-based VM-execution controls bit 7 (virtualize
> > IA32_SPEC_CTRL), and KVM cedes the IA32_SPEC_CTRL MSR to the guest on
> > the first non-zero write.
>
> The first half of the series adds the support for virtualizing
> IA32_SPEC_CTRL. Atleast that part is worth reconsidering.
>
> https://lore.kernel.org/lkml/20240410143446.797262-1-chao.gao@intel.com/

Yes, the support for virtualizing SPEC_CTRL should be submitted separately.

> > > I suppose your point is in the context of Native BHI mitigation for the
> > > guests.
> >
> > Specific vulnerabilities aside, my point is that one cannot infer
> > anything about the underlying hardware from the presence or absence of
> > BHI_CTRL in a VM.
>
> Agree.
>
> > > > I think we need an explicit CPUID bit that a hypervisor can set to
> > > > indicate that the underlying hardware might be SPR or later.
> > >
> > > Something similar was attempted via virtual-MSRs in the below series:
> > >
> > > [RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
> > > https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/
> > >
> > > Do you think a rework of this approach would help?
> >
> > No, I think that whole idea is ill-conceived.  As I said above, the
> > hypervisor should just set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> > behalf when BHI_CTRL is not advertised to the guest. I don't see any
> > value in predicating this mitigation on guest usage of the short BHB
> > clearing sequence. Just do it.
>
> There are cases where this would be detrimental:
>
> 1. A guest disabling the mitigation in favor of performance.
> 2. A guest deploying the long SW sequence would suffer from two mitigations
>    for the same vulnerability.

The guest is already getting a performance boost from the newer
microarchitecture, so I think this argument is moot.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-07  0:35           ` Jim Mattson
@ 2026-03-07  1:00             ` Pawan Gupta
  2026-03-07  1:10               ` Jim Mattson
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2026-03-07  1:00 UTC (permalink / raw)
  To: Jim Mattson
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

+Chao

On Fri, Mar 06, 2026 at 04:35:49PM -0800, Jim Mattson wrote:
> > > > > I think we need an explicit CPUID bit that a hypervisor can set to
> > > > > indicate that the underlying hardware might be SPR or later.
> > > >
> > > > Something similar was attempted via virtual-MSRs in the below series:
> > > >
> > > > [RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
> > > > https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/
> > > >
> > > > Do you think a rework of this approach would help?
> > >
> > > No, I think that whole idea is ill-conceived.  As I said above, the
> > > hypervisor should just set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> > > behalf when BHI_CTRL is not advertised to the guest. I don't see any
> > > value in predicating this mitigation on guest usage of the short BHB
> > > clearing sequence. Just do it.
> >
> > There are cases where this would be detrimental:
> >
> > 1. A guest disabling the mitigation in favor of performance.
> > 2. A guest deploying the long SW sequence would suffer from two mitigations
> >    for the same vulnerability.
> 
> The guest is already getting a performance boost from the newer
> microarchitecture, so I think this argument is moot.

For a Linux guest this is mostly true. IIRC, there is atleast one major
non-Linux OS that suffers heavily from BHI_DIS_S.

If the enforcement is controlled by the userspace VMM, it is definitely
worth enabling KVM to mitigate on behalf of guests.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-07  1:00             ` Pawan Gupta
@ 2026-03-07  1:10               ` Jim Mattson
  2026-03-07  2:41                 ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Jim Mattson @ 2026-03-07  1:10 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

On Fri, Mar 6, 2026 at 5:01 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> +Chao
>
> On Fri, Mar 06, 2026 at 04:35:49PM -0800, Jim Mattson wrote:
> > > > > > I think we need an explicit CPUID bit that a hypervisor can set to
> > > > > > indicate that the underlying hardware might be SPR or later.
> > > > >
> > > > > Something similar was attempted via virtual-MSRs in the below series:
> > > > >
> > > > > [RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
> > > > > https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/
> > > > >
> > > > > Do you think a rework of this approach would help?
> > > >
> > > > No, I think that whole idea is ill-conceived.  As I said above, the
> > > > hypervisor should just set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> > > > behalf when BHI_CTRL is not advertised to the guest. I don't see any
> > > > value in predicating this mitigation on guest usage of the short BHB
> > > > clearing sequence. Just do it.
> > >
> > > There are cases where this would be detrimental:
> > >
> > > 1. A guest disabling the mitigation in favor of performance.
> > > 2. A guest deploying the long SW sequence would suffer from two mitigations
> > >    for the same vulnerability.
> >
> > The guest is already getting a performance boost from the newer
> > microarchitecture, so I think this argument is moot.
>
> For a Linux guest this is mostly true. IIRC, there is atleast one major
> non-Linux OS that suffers heavily from BHI_DIS_S.

Presumably, this guest OS wants to deploy the long sequence (if it may
run on SPR and later) and doesn't want BHI_DIS_S foisted on it. I
don't recall that negotiation being possible with
MSR_VIRTUAL_MITIGATION_CTRL.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-07  1:10               ` Jim Mattson
@ 2026-03-07  2:41                 ` Pawan Gupta
  2026-03-07  5:05                   ` Jim Mattson
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2026-03-07  2:41 UTC (permalink / raw)
  To: Jim Mattson
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

On Fri, Mar 06, 2026 at 05:10:23PM -0800, Jim Mattson wrote:
> On Fri, Mar 6, 2026 at 5:01 PM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > +Chao
> >
> > On Fri, Mar 06, 2026 at 04:35:49PM -0800, Jim Mattson wrote:
> > > > > > > I think we need an explicit CPUID bit that a hypervisor can set to
> > > > > > > indicate that the underlying hardware might be SPR or later.
> > > > > >
> > > > > > Something similar was attempted via virtual-MSRs in the below series:
> > > > > >
> > > > > > [RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
> > > > > > https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/
> > > > > >
> > > > > > Do you think a rework of this approach would help?
> > > > >
> > > > > No, I think that whole idea is ill-conceived.  As I said above, the
> > > > > hypervisor should just set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> > > > > behalf when BHI_CTRL is not advertised to the guest. I don't see any
> > > > > value in predicating this mitigation on guest usage of the short BHB
> > > > > clearing sequence. Just do it.
> > > >
> > > > There are cases where this would be detrimental:
> > > >
> > > > 1. A guest disabling the mitigation in favor of performance.
> > > > 2. A guest deploying the long SW sequence would suffer from two mitigations
> > > >    for the same vulnerability.
> > >
> > > The guest is already getting a performance boost from the newer
> > > microarchitecture, so I think this argument is moot.
> >
> > For a Linux guest this is mostly true. IIRC, there is atleast one major
> > non-Linux OS that suffers heavily from BHI_DIS_S.
> 
> Presumably, this guest OS wants to deploy the long sequence (if it may
> run on SPR and later) and doesn't want BHI_DIS_S foisted on it. I
> don't recall that negotiation being possible with
> MSR_VIRTUAL_MITIGATION_CTRL.

Patch 4/10 of that series is about BHI_DIS_S negotiation. A guest had to
set MITI_CTRL_BHB_CLEAR_SEQ_S_USED to indicate that it isn't aware of the
BHI_DIS_S control and is using the short sequence (ya, there is nothing
about the long sequence). When KVM sees this bit set, it deploys BHI_DIS_S
for that guest.

x86/bugs: Use Virtual MSRs to request BHI_DIS_S
https://lore.kernel.org/lkml/20240410143446.797262-5-chao.gao@intel.com/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-07  2:41                 ` Pawan Gupta
@ 2026-03-07  5:05                   ` Jim Mattson
  2026-03-09 22:29                     ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Jim Mattson @ 2026-03-07  5:05 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

On Fri, Mar 6, 2026 at 6:41 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Fri, Mar 06, 2026 at 05:10:23PM -0800, Jim Mattson wrote:
> > On Fri, Mar 6, 2026 at 5:01 PM Pawan Gupta
> > <pawan.kumar.gupta@linux.intel.com> wrote:
> > >
> > > +Chao
> > >
> > > On Fri, Mar 06, 2026 at 04:35:49PM -0800, Jim Mattson wrote:
> > > > > > > > I think we need an explicit CPUID bit that a hypervisor can set to
> > > > > > > > indicate that the underlying hardware might be SPR or later.
> > > > > > >
> > > > > > > Something similar was attempted via virtual-MSRs in the below series:
> > > > > > >
> > > > > > > [RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
> > > > > > > https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/
> > > > > > >
> > > > > > > Do you think a rework of this approach would help?
> > > > > >
> > > > > > No, I think that whole idea is ill-conceived.  As I said above, the
> > > > > > hypervisor should just set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> > > > > > behalf when BHI_CTRL is not advertised to the guest. I don't see any
> > > > > > value in predicating this mitigation on guest usage of the short BHB
> > > > > > clearing sequence. Just do it.
> > > > >
> > > > > There are cases where this would be detrimental:
> > > > >
> > > > > 1. A guest disabling the mitigation in favor of performance.
> > > > > 2. A guest deploying the long SW sequence would suffer from two mitigations
> > > > >    for the same vulnerability.
> > > >
> > > > The guest is already getting a performance boost from the newer
> > > > microarchitecture, so I think this argument is moot.
> > >
> > > For a Linux guest this is mostly true. IIRC, there is atleast one major
> > > non-Linux OS that suffers heavily from BHI_DIS_S.
> >
> > Presumably, this guest OS wants to deploy the long sequence (if it may
> > run on SPR and later) and doesn't want BHI_DIS_S foisted on it. I
> > don't recall that negotiation being possible with
> > MSR_VIRTUAL_MITIGATION_CTRL.
>
> Patch 4/10 of that series is about BHI_DIS_S negotiation. A guest had to
> set MITI_CTRL_BHB_CLEAR_SEQ_S_USED to indicate that it isn't aware of the
> BHI_DIS_S control and is using the short sequence (ya, there is nothing
> about the long sequence). When KVM sees this bit set, it deploys BHI_DIS_S
> for that guest.
>
> x86/bugs: Use Virtual MSRs to request BHI_DIS_S
> https://lore.kernel.org/lkml/20240410143446.797262-5-chao.gao@intel.com/

Ah. I see now. I missed this part of the specification: "Guest OSes
that are using long or TSX sequences can optionally clear
BHB_CLEAR_SEQ_S_USED bit in order to communicate this to the VMM."

Maybe this would be less confusing if BHB_CLEAR_SEQ_S_USED were named
more clearly. Perhaps something like "SET_BHI_DIS_S_FOR_ME"?

Is it reasonable to assume that without the presence of BHI_CTRL, the
non-Linux OS we've been discussing will (ironically) only use the long
sequence if the hypervisor advertises BHB_CLEAR_SEQ_S_SUPPORT? That
is, without BHB_CLEAR_SEQ_S_SUPPORT, does it assume the short sequence
is adequate?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-07  5:05                   ` Jim Mattson
@ 2026-03-09 22:29                     ` Pawan Gupta
  2026-03-09 23:05                       ` Jim Mattson
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2026-03-09 22:29 UTC (permalink / raw)
  To: Jim Mattson
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

On Fri, Mar 06, 2026 at 09:05:01PM -0800, Jim Mattson wrote:
> On Fri, Mar 6, 2026 at 6:41 PM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > On Fri, Mar 06, 2026 at 05:10:23PM -0800, Jim Mattson wrote:
> > > On Fri, Mar 6, 2026 at 5:01 PM Pawan Gupta
> > > <pawan.kumar.gupta@linux.intel.com> wrote:
> > > >
> > > > +Chao
> > > >
> > > > On Fri, Mar 06, 2026 at 04:35:49PM -0800, Jim Mattson wrote:
> > > > > > > > > I think we need an explicit CPUID bit that a hypervisor can set to
> > > > > > > > > indicate that the underlying hardware might be SPR or later.
> > > > > > > >
> > > > > > > > Something similar was attempted via virtual-MSRs in the below series:
> > > > > > > >
> > > > > > > > [RFC PATCH v3 09/10] KVM: VMX: Advertise MITI_CTRL_BHB_CLEAR_SEQ_S_SUPPORT
> > > > > > > > https://lore.kernel.org/lkml/20240410143446.797262-10-chao.gao@intel.com/
> > > > > > > >
> > > > > > > > Do you think a rework of this approach would help?
> > > > > > >
> > > > > > > No, I think that whole idea is ill-conceived.  As I said above, the
> > > > > > > hypervisor should just set IA32_SPEC_CTRL.BHI_DIS_S on the guest's
> > > > > > > behalf when BHI_CTRL is not advertised to the guest. I don't see any
> > > > > > > value in predicating this mitigation on guest usage of the short BHB
> > > > > > > clearing sequence. Just do it.
> > > > > >
> > > > > > There are cases where this would be detrimental:
> > > > > >
> > > > > > 1. A guest disabling the mitigation in favor of performance.
> > > > > > 2. A guest deploying the long SW sequence would suffer from two mitigations
> > > > > >    for the same vulnerability.
> > > > >
> > > > > The guest is already getting a performance boost from the newer
> > > > > microarchitecture, so I think this argument is moot.
> > > >
> > > > For a Linux guest this is mostly true. IIRC, there is atleast one major
> > > > non-Linux OS that suffers heavily from BHI_DIS_S.
> > >
> > > Presumably, this guest OS wants to deploy the long sequence (if it may
> > > run on SPR and later) and doesn't want BHI_DIS_S foisted on it. I
> > > don't recall that negotiation being possible with
> > > MSR_VIRTUAL_MITIGATION_CTRL.
> >
> > Patch 4/10 of that series is about BHI_DIS_S negotiation. A guest had to
> > set MITI_CTRL_BHB_CLEAR_SEQ_S_USED to indicate that it isn't aware of the
> > BHI_DIS_S control and is using the short sequence (ya, there is nothing
> > about the long sequence). When KVM sees this bit set, it deploys BHI_DIS_S
> > for that guest.
> >
> > x86/bugs: Use Virtual MSRs to request BHI_DIS_S
> > https://lore.kernel.org/lkml/20240410143446.797262-5-chao.gao@intel.com/
> 
> Ah. I see now. I missed this part of the specification: "Guest OSes
> that are using long or TSX sequences can optionally clear
> BHB_CLEAR_SEQ_S_USED bit in order to communicate this to the VMM."
> 
> Maybe this would be less confusing if BHB_CLEAR_SEQ_S_USED were named
> more clearly. Perhaps something like "SET_BHI_DIS_S_FOR_ME"?

Ya, that would have been clearer.

> Is it reasonable to assume that without the presence of BHI_CTRL, the
> non-Linux OS we've been discussing will (ironically) only use the long
> sequence if the hypervisor advertises BHB_CLEAR_SEQ_S_SUPPORT? That
> is, without BHB_CLEAR_SEQ_S_SUPPORT, does it assume the short sequence
> is adequate?

I don't know. But, it doesn't seem logical to assume short sequence is
adequate when the guest can't ensure that VMM would do BHI_DIS_S for it. It
should be the other way around.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-09 22:29                     ` Pawan Gupta
@ 2026-03-09 23:05                       ` Jim Mattson
  2026-03-10  0:00                         ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Jim Mattson @ 2026-03-09 23:05 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

On Mon, Mar 9, 2026 at 3:29 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
> > Is it reasonable to assume that without the presence of BHI_CTRL, the
> > non-Linux OS we've been discussing will (ironically) only use the long
> > sequence if the hypervisor advertises BHB_CLEAR_SEQ_S_SUPPORT? That
> > is, without BHB_CLEAR_SEQ_S_SUPPORT, does it assume the short sequence
> > is adequate?
>
> I don't know. But, it doesn't seem logical to assume short sequence is
> adequate when the guest can't ensure that VMM would do BHI_DIS_S for it. It
> should be the other way around.

Assuming BHI_NO is clear...

If the hypervisor offers to enable BHI_DIS_S for you, then the
migration pool may contain SPR+, so you need the long sequence if
you're going to clear the BHB in software rather than accept the
hypervisor's offer.
You are saying that if the hypervisor does not offer to enable
BHI_DIS_S for you, then you know nothing, so you need the long
sequence.

How would a guest that refuses BHI_DIS_S ever be able to use the short sequence?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-09 23:05                       ` Jim Mattson
@ 2026-03-10  0:00                         ` Pawan Gupta
  2026-03-10  0:08                           ` Jim Mattson
  0 siblings, 1 reply; 63+ messages in thread
From: Pawan Gupta @ 2026-03-10  0:00 UTC (permalink / raw)
  To: Jim Mattson
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

On Mon, Mar 09, 2026 at 04:05:55PM -0700, Jim Mattson wrote:
> On Mon, Mar 9, 2026 at 3:29 PM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> > > Is it reasonable to assume that without the presence of BHI_CTRL, the
> > > non-Linux OS we've been discussing will (ironically) only use the long
> > > sequence if the hypervisor advertises BHB_CLEAR_SEQ_S_SUPPORT? That
> > > is, without BHB_CLEAR_SEQ_S_SUPPORT, does it assume the short sequence
> > > is adequate?
> >
> > I don't know. But, it doesn't seem logical to assume short sequence is
> > adequate when the guest can't ensure that VMM would do BHI_DIS_S for it. It
> > should be the other way around.
> 
> Assuming BHI_NO is clear...
> 
> If the hypervisor offers to enable BHI_DIS_S for you, then the
> migration pool may contain SPR+, so you need the long sequence if
> you're going to clear the BHB in software rather than accept the
> hypervisor's offer.
> You are saying that if the hypervisor does not offer to enable
> BHI_DIS_S for you, then you know nothing, so you need the long
> sequence.

This is all when MSR_VIRTUAL_MITIGATION_* is enumerated by the VMM.

> How would a guest that refuses BHI_DIS_S ever be able to use the short sequence?

Without those MSRs a guest could ignore the migration case, and use the
short sequence.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-10  0:00                         ` Pawan Gupta
@ 2026-03-10  0:08                           ` Jim Mattson
  2026-03-10  0:52                             ` Pawan Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Jim Mattson @ 2026-03-10  0:08 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

On Mon, Mar 9, 2026 at 5:00 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Mon, Mar 09, 2026 at 04:05:55PM -0700, Jim Mattson wrote:
> > On Mon, Mar 9, 2026 at 3:29 PM Pawan Gupta
> > <pawan.kumar.gupta@linux.intel.com> wrote:
> > > > Is it reasonable to assume that without the presence of BHI_CTRL, the
> > > > non-Linux OS we've been discussing will (ironically) only use the long
> > > > sequence if the hypervisor advertises BHB_CLEAR_SEQ_S_SUPPORT? That
> > > > is, without BHB_CLEAR_SEQ_S_SUPPORT, does it assume the short sequence
> > > > is adequate?
> > >
> > > I don't know. But, it doesn't seem logical to assume short sequence is
> > > adequate when the guest can't ensure that VMM would do BHI_DIS_S for it. It
> > > should be the other way around.
> >
> > Assuming BHI_NO is clear...
> >
> > If the hypervisor offers to enable BHI_DIS_S for you, then the
> > migration pool may contain SPR+, so you need the long sequence if
> > you're going to clear the BHB in software rather than accept the
> > hypervisor's offer.
> > You are saying that if the hypervisor does not offer to enable
> > BHI_DIS_S for you, then you know nothing, so you need the long
> > sequence.
>
> This is all when MSR_VIRTUAL_MITIGATION_* is enumerated by the VMM.
>
> > How would a guest that refuses BHI_DIS_S ever be able to use the short sequence?
>
> Without those MSRs a guest could ignore the migration case, and use the
> short sequence.

Ah. So, a hypervisor only advertises MSR_VIRTUAL_MITIGATION_* when the
migration pool spans the change in BHB size?

Why not just declare that explicitly? This whole mechanism seems to be
an exercise in obfuscation.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
  2026-03-10  0:08                           ` Jim Mattson
@ 2026-03-10  0:52                             ` Pawan Gupta
  0 siblings, 0 replies; 63+ messages in thread
From: Pawan Gupta @ 2026-03-10  0:52 UTC (permalink / raw)
  To: Jim Mattson
  Cc: x86, David Kaplan, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, Sean Christopherson, Paolo Bonzini,
	Borislav Petkov, Dave Hansen, linux-kernel, kvm, Asit Mallick,
	Tao Zhang, David Dunn, chao.gao

On Mon, Mar 09, 2026 at 05:08:09PM -0700, Jim Mattson wrote:
> On Mon, Mar 9, 2026 at 5:00 PM Pawan Gupta
> <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > On Mon, Mar 09, 2026 at 04:05:55PM -0700, Jim Mattson wrote:
> > > On Mon, Mar 9, 2026 at 3:29 PM Pawan Gupta
> > > <pawan.kumar.gupta@linux.intel.com> wrote:
> > > > > Is it reasonable to assume that without the presence of BHI_CTRL, the
> > > > > non-Linux OS we've been discussing will (ironically) only use the long
> > > > > sequence if the hypervisor advertises BHB_CLEAR_SEQ_S_SUPPORT? That
> > > > > is, without BHB_CLEAR_SEQ_S_SUPPORT, does it assume the short sequence
> > > > > is adequate?
> > > >
> > > > I don't know. But, it doesn't seem logical to assume short sequence is
> > > > adequate when the guest can't ensure that VMM would do BHI_DIS_S for it. It
> > > > should be the other way around.
> > >
> > > Assuming BHI_NO is clear...
> > >
> > > If the hypervisor offers to enable BHI_DIS_S for you, then the
> > > migration pool may contain SPR+, so you need the long sequence if
> > > you're going to clear the BHB in software rather than accept the
> > > hypervisor's offer.
> > > You are saying that if the hypervisor does not offer to enable
> > > BHI_DIS_S for you, then you know nothing, so you need the long
> > > sequence.
> >
> > This is all when MSR_VIRTUAL_MITIGATION_* is enumerated by the VMM.
> >
> > > How would a guest that refuses BHI_DIS_S ever be able to use the short sequence?
> >
> > Without those MSRs a guest could ignore the migration case, and use the
> > short sequence.
> 
> Ah. So, a hypervisor only advertises MSR_VIRTUAL_MITIGATION_* when the
> migration pool spans the change in BHB size?

Not really, virtual MSRs can be exposed to guests with and without the
migration pools.

> Why not just declare that explicitly? This whole mechanism seems to be
> an exercise in obfuscation.

What I meant was without the virtual MSRs, a guest has no way to know
whether it could be in a migration pool that spans the change in BHB size.
It has to assume one way or the other.

When the virtual MSRs are enumerated we can better handle the migration
cases. A guest would set/clear BHB_CLEAR_SEQ_S_USED based on whether it
deploys long or short sequence. And When the guest migrates, a VMM would
update the guest's mitigation under-the-hood based on:

	if guest BHB_CLEAR_SEQ_S_USED == 1:
		if host.CPUID.BHI_CTRL supported:
			Set guest.virtual.SPEC_CTRL[BHI_DIS_S]
		else:
			Do nothing, short sequence is sufficient
	else:
		Do nothing, guest indicated that it is not using short sequence

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2026-03-10  0:53 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20  6:17 [PATCH v4 00/11] VMSCAPE optimization for BHI variant Pawan Gupta
2025-11-20  6:17 ` [PATCH v4 01/11] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop() Pawan Gupta
2025-11-20 16:15   ` Nikolay Borisov
2025-11-20 16:56     ` Pawan Gupta
2025-11-20 16:58       ` Nikolay Borisov
2025-11-20  6:18 ` [PATCH v4 02/11] x86/bhi: Move the BHB sequence to a macro for reuse Pawan Gupta
2025-11-20 16:28   ` Nikolay Borisov
2025-11-20 16:57     ` Pawan Gupta
2025-11-25  0:21   ` Pawan Gupta
2025-11-20  6:18 ` [PATCH v4 03/11] x86/bhi: Make the depth of BHB-clearing configurable Pawan Gupta
2025-11-20 17:02   ` Nikolay Borisov
2025-11-20  6:18 ` [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on newer CPUs Pawan Gupta
2025-11-21 12:33   ` Nikolay Borisov
2025-11-21 16:40   ` Dave Hansen
2025-11-21 16:45     ` Nikolay Borisov
2025-11-21 16:50       ` Dave Hansen
2025-11-21 18:16         ` Pawan Gupta
2025-11-21 18:42           ` Dave Hansen
2025-11-21 21:26             ` Pawan Gupta
2025-11-21 21:36               ` Dave Hansen
2025-11-24 19:21                 ` Pawan Gupta
2025-11-22 11:05               ` david laight
2025-11-24 19:31                 ` Pawan Gupta
2025-11-25 11:34                   ` david laight
2025-12-04  1:40                     ` Pawan Gupta
2025-12-04  9:15                       ` david laight
2025-12-04 21:56                         ` Pawan Gupta
2025-12-05  9:21                           ` david laight
2025-11-26 19:23     ` Pawan Gupta
2026-03-06 21:00   ` Jim Mattson
2026-03-06 22:32     ` Pawan Gupta
2026-03-06 22:57       ` Jim Mattson
2026-03-06 23:29         ` Pawan Gupta
2026-03-07  0:35           ` Jim Mattson
2026-03-07  1:00             ` Pawan Gupta
2026-03-07  1:10               ` Jim Mattson
2026-03-07  2:41                 ` Pawan Gupta
2026-03-07  5:05                   ` Jim Mattson
2026-03-09 22:29                     ` Pawan Gupta
2026-03-09 23:05                       ` Jim Mattson
2026-03-10  0:00                         ` Pawan Gupta
2026-03-10  0:08                           ` Jim Mattson
2026-03-10  0:52                             ` Pawan Gupta
2025-11-20  6:18 ` [PATCH v4 05/11] x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user Pawan Gupta
2025-11-20  6:19 ` [PATCH v4 06/11] x86/vmscape: Move mitigation selection to a switch() Pawan Gupta
2025-11-21 14:27   ` Nikolay Borisov
2025-11-24 23:09     ` Pawan Gupta
2025-11-25 10:19       ` Nikolay Borisov
2025-11-25 17:45         ` Pawan Gupta
2025-11-20  6:19 ` [PATCH v4 07/11] x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier() Pawan Gupta
2025-11-21 12:59   ` Nikolay Borisov
2025-11-20  6:19 ` [PATCH v4 08/11] x86/vmscape: Use static_call() for predictor flush Pawan Gupta
2025-11-20  6:19 ` [PATCH v4 09/11] x86/vmscape: Deploy BHB clearing mitigation Pawan Gupta
2025-11-21 14:18   ` Nikolay Borisov
2025-11-21 18:29     ` Pawan Gupta
2025-11-21 14:23   ` Nikolay Borisov
2025-11-21 18:41     ` Pawan Gupta
2025-11-21 18:53       ` Nikolay Borisov
2025-11-21 21:29         ` Pawan Gupta
2025-11-20  6:20 ` [PATCH v4 10/11] x86/vmscape: Override conflicting attack-vector controls with =force Pawan Gupta
2025-11-21 18:04   ` Nikolay Borisov
2025-11-20  6:20 ` [PATCH v4 11/11] x86/vmscape: Add cmdline vmscape=on to override attack vector controls Pawan Gupta
2025-11-25 11:41   ` Nikolay Borisov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox