[MODERATED] Encrypted Message

All of lore.kernel.org
 help / color / mirror / Atom feed

* [MODERATED] Encrypted Message
  2018-05-02 21:51 [patch V11 00/16] SSB 0 Thomas Gleixner
@ 2018-05-03  4:27 ` Tim Chen
  0 siblings, 0 replies; 72+ messages in thread
From: Tim Chen @ 2018-05-03  4:27 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 133 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V11 00/16] SSB 0

[-- Attachment #2: Type: text/plain, Size: 1580 bytes --]



On 05/02/2018 02:51 PM, speck for Thomas Gleixner wrote:
> Changes since V10:
> 
>   - Addressed Ingos review feedback
> 
>   - Picked up Reviewed-bys
> 
> Delta patch below. Bundle is coming in separate mail. Git repo branches are
> updated as well. The master branch contains also the fix for the lost IBRS
> issue Tim was seeing.
> 
> If there are no further issues and nitpicks, I'm going to make the
> changes immutable and changes need to go incremental on top.
> 
> Thanks,
> 
> 	tglx
> 
> 

I notice that this code ignores the current process's TIF_RDS setting
in the prctl case:

#define firmware_restrict_branch_speculation_end()                      \
do {                                                                    \
        u64 val = x86_get_default_spec_ctrl();                          \
                                                                        \
        alternative_msr_write(MSR_IA32_SPEC_CTRL, val,                  \
                              X86_FEATURE_USE_IBRS_FW);                 \
        preempt_enable();                                               \
} while (0)

x86_get_default_spec_ctrl will return x86_spec_ctrl_base, which
will result in x86_spec_ctrl_base written to the MSR
in the prctl case for Intel CPU.  That incorrectly ignores current
process's TIF_RDS setting and the RDS bit will not be set.

Instead, the following value should have been written to the MSR
for Intel CPU:
x86_spec_ctrl_base | rds_tif_to_spec_ctrl(current_thread_info()->flags)

Thanks.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-18 14:29   ` Thomas Gleixner
@ 2018-05-18 19:50     ` Tim Chen
  0 siblings, 0 replies; 72+ messages in thread
From: Tim Chen @ 2018-05-18 19:50 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 163 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: Is: Sleep states ?Was:Re: SSB status - V18 pushed out

[-- Attachment #2: Type: text/plain, Size: 2667 bytes --]

On 05/18/2018 07:29 AM, speck for Thomas Gleixner wrote:
> On Fri, 18 May 2018, speck for Konrad Rzeszutek Wilk wrote:
>> On Thu, May 17, 2018 at 10:53:28PM +0200, speck for Thomas Gleixner wrote:
>>> Folks,
>>>
>>> we finally reached a stable state with the SSB patches. I've updated all 3
>>> branches master/linux-4.16.y/linux-4.14.y in the repo and attached the
>>> resulting git bundles. They merge cleanly on top of the current HEADs of
>>> the relevant trees.
>>>
>>> The lot survived light testing on my side and it would be great if everyone
>>> involved could expose it to their test scenarios.
>>>
>>> Thanks to everyone who participated in that effort (patches, review,
>>> testing ...)!
>>
>> Yeey! Thank you.
>>
>> I was reading the updated Intel doc today (instead of skim reading it) and it mentioned:
>>
>> "Intel recommends that the SSBD MSR bit be cleared when in a sleep state on such processors."
> 
> Well, the same recommendation was for IBRS and the reason is that with HT
> enabled the other hyperthread will not be able to go full speed because the
> sleeping one vanished with IBRS set. SSBD works the same way.
> 
> " SW should clear [SSBD] when enter sleep state, just as is suggested for
>   IBRS and STIBP on existing implementations"
> 
> and that document says:
> 
> "Enabling IBRS on one logical processor of a core with Intel
>  Hyper-Threading Technology may affect branch prediction on other logical
>  processors of the same core. For this reason, software should disable IBRS
>  (by clearing IA32_SPEC_CTRL.IBRS) prior to entering a sleep state (e.g.,
>  by executing HLT or MWAIT) and re-enable IBRS upon wakeup and prior to
>  executing any indirect branch."
> 
> So it's only a performance issue and not a fundamental problem to have it
> on when executing HLT/MWAIT
> 
> So we have two situations here:
> 
> 1) ssbd = on, i.e X86_FEATURE_SPEC_STORE_BYPASS_DISABLE
> 
>    There it is irrelevant because both threads have SSBD set permanentely,
>    so unsetting it on HLT/MWAIT is not going to lift the restriction for
>    the running sibling thread. And HLT/MWAIT is not going to be faster by
>    unsetting it and then setting it on wakeup again....
> 
> 2) SSBD via prctl/seccomp
> 
>    Nothing to do there, because idle task does not have TIF_SSBD set so it
>    never goes with SSBD set into HLT/MWAIT.
> 
> So I think we're good, but it would be nice if Intel folks would confirm
> that.

Yes, we have thought about turning off SSBD in the mwait path earlier. But
decided that it was unnecessary for the exact reasons Thomas mentioned.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-24 15:33             ` Thomas Gleixner
@ 2018-05-24 23:18               ` Tim Chen
  2018-05-25 18:22                 ` Tim Chen
  2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
  0 siblings, 2 replies; 72+ messages in thread
From: Tim Chen @ 2018-05-24 23:18 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

[-- Attachment #2: Type: text/plain, Size: 3619 bytes --]

On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
>>>> The microcode trick just makes it a lot easier because we don't
>>>> have to *explicitly* pause the sibling vCPUs and manage their state on
>>>> every vmexit/entry. And avoids potential race conditions with managing
>>>> that in software.
>>>
>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
>>> instance, avoid having to modify irq_enter() / irq_exit(), which would
>>> otherwise be required (and possibly leak all data touched up until that
>>> point is reached).
>>>
>>> But even with all that, adding L1-flush to every VMENTER will hurt lots.
>>> Consider for example the PIO emulation used when booting a guest from a
>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>
>> Just did a test on SKL Client where I have ucode. It does not have HT so
>> its not suffering from any HT side effects when L1D is flushed.
>>
>> Boot time from a disk image is ~1s measured from the first vcpu enter.
>>
>> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
>> lots of PIO operations in the early boot.
>>
>> For a kernel build the L1D Flush has an overhead of < 1%.
>>
>> Netperf guest to host has a slight drop of the throughput in the 2%
>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>
>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
>> measure the overhead. Running cyclictest with a period of 25us in the guest
>> on a isolated guest CPU and monitoring the behaviour with perf on the host
>> for the corresponding host CPU gives
>>
>> No Flush	      	       Flush
>>
>> 1.31 insn per cycle	       1.14 insn per cycle
>>
>> 2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec
>>
>> In that simple test the L1D misses go up by a factor of 13.
>>
>> Now with the whole gang scheduling the numbers I heard through the
>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>> disk image. 13 minutes instead of 6 seconds...

The performance is highly dependent on how often we VM exit.
Working with Peter Z on his prototype, the performance ranges from
no regression for a network loop back, ~20% regression for kernel compile
to ~100% regression on File IO.  PIO brings out the worse aspect
of the synchronization overhead as we VM exit on every dword PIO read in, and the
kernel and initrd image was about 50 MB for the experiment, and led to
13 min of load time.

We may need to do the co-scheduling only when VM exit rate is low, and
turn off the SMT when VM exit rate becomes too high.

(Note: I haven't added in the L1 flush on VM entry for my experiment, that is on
the todo).

Tim


>>
>> That's not surprising at all, though the magnitude is way higher than I
>> expected. I don't see a realistic chance for vmexit heavy workloads to work
>> with that synchronization thing at all, whether it's ucode assisted or not.
> 
> That said, I think we should stage the host side mitigations plus the L1
> flush on vmenter ASAP so we are not standing there with our pants down when
> the cat comes out of the bag early. That means HT off, but it's still
> better than having absolutely nothing.
> 
> The gang scheduling nonsense can be added on top when it should
> surprisingly turn out to be usable at all.
> 
> Thanks,
> 
> 	tglx
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-24 23:18               ` [MODERATED] Encrypted Message Tim Chen
@ 2018-05-25 18:22                 ` Tim Chen
  2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
  1 sibling, 0 replies; 72+ messages in thread
From: Tim Chen @ 2018-05-25 18:22 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 127 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Tim Chen <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

[-- Attachment #2: Type: text/plain, Size: 3260 bytes --]

On 05/24/2018 04:18 PM, speck for Tim Chen wrote:
> On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
>>>>> The microcode trick just makes it a lot easier because we don't
>>>>> have to *explicitly* pause the sibling vCPUs and manage their state on
>>>>> every vmexit/entry. And avoids potential race conditions with managing
>>>>> that in software.
>>>>
>>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
>>>> instance, avoid having to modify irq_enter() / irq_exit(), which would
>>>> otherwise be required (and possibly leak all data touched up until that
>>>> point is reached).
>>>>
>>>> But even with all that, adding L1-flush to every VMENTER will hurt lots.
>>>> Consider for example the PIO emulation used when booting a guest from a
>>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>>
>>> Just did a test on SKL Client where I have ucode. It does not have HT so
>>> its not suffering from any HT side effects when L1D is flushed.
>>>
>>> Boot time from a disk image is ~1s measured from the first vcpu enter.
>>>
>>> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
>>> lots of PIO operations in the early boot.
>>>
>>> For a kernel build the L1D Flush has an overhead of < 1%.
>>>
>>> Netperf guest to host has a slight drop of the throughput in the 2%
>>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>>
>>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
>>> measure the overhead. Running cyclictest with a period of 25us in the guest
>>> on a isolated guest CPU and monitoring the behaviour with perf on the host
>>> for the corresponding host CPU gives
>>>
>>> No Flush	      	       Flush
>>>
>>> 1.31 insn per cycle	       1.14 insn per cycle
>>>
>>> 2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec
>>>
>>> In that simple test the L1D misses go up by a factor of 13.
>>>
>>> Now with the whole gang scheduling the numbers I heard through the
>>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>>> disk image. 13 minutes instead of 6 seconds...
> 
> The performance is highly dependent on how often we VM exit.
> Working with Peter Z on his prototype, the performance ranges from
> no regression for a network loop back, ~20% regression for kernel compile
> to ~100% regression on File IO.  PIO brings out the worse aspect
> of the synchronization overhead as we VM exit on every dword PIO read in, and the
> kernel and initrd image was about 50 MB for the experiment, and led to
> 13 min of load time.
> 
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.
> 
> (Note: I haven't added in the L1 flush on VM entry for my experiment, that is on
> the todo).

As a post note, I added in the L1 flush and the performance numbers
pretty much stay the same.  So the synchronization overhead is
dominant and L1 flush overhead is secondary.

Tim



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
@ 2018-05-29 19:29                   ` Tim Chen
  2018-05-29 21:14                     ` L1D-Fault KVM mitigation Thomas Gleixner
  0 siblings, 1 reply; 72+ messages in thread
From: Tim Chen @ 2018-05-29 19:29 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

[-- Attachment #2: Type: text/plain, Size: 1799 bytes --]

On 05/26/2018 12:14 PM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Tim Chen wrote:
> 
 of load time.
>>
>> We may need to do the co-scheduling only when VM exit rate is low, and
>> turn off the SMT when VM exit rate becomes too high.
> 
> You cannot do that during runtime. That will destroy placement schemes and
> whatever. The SMT off decision needs to be done at a quiescent moment,
> i.e. before starting VMs.

Taking the SMT offline is a bit much and too big a hammer.  Andi and I thought about
having the scheduler forcing the other thread in idle instead for high
VM exit rate scenario. We don't have
to bother about doing sync with the other idle thread.

But we have issues about fairness, as we will be starving the
other run queue.

> 

> Running the same compile single threaded (offlining vCPU1 in the guest)
> increases the time to 107 seconds.
> 
>     107 / 88  = 1.22
> 
> I.e. it's 20% slower than the one using two threads. That means that it is
> the same slowdown as having two threads synchronized (your number).

yes, with compile workload, the HT speedup was mostly eaten up by
overhead.

> 
> So if I take the above example and assume that the overhead of
> synchronization is ~20% then the average vmenter/vmexit time is close to
> 50us.
> 

> 
> So I can see the usefulness for scenarious which David Woodhouse described
> where vCPU and host CPU have a fixed relationship and the guests exit once
> in a while. But that should really be done with ucode assisantance which
> avoids all the nasty synchronization hackery more or less completely.

The ucode guys are looking into such possibilities.  It is tough as they
have to work within the constraint of limited ucode headroom.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] [PATCH 0/2] L1TF KVM 0
@ 2018-05-29 19:42 Paolo Bonzini
  2018-05-29 19:42 ` [MODERATED] [PATCH 1/2] L1TF KVM 1 Paolo Bonzini
                   ` (6 more replies)
  0 siblings, 7 replies; 72+ messages in thread
From: Paolo Bonzini @ 2018-05-29 19:42 UTC (permalink / raw)
  To: speck

Here is the first version of the L1 terminal fault KVM mitigation patches,
adding a TLB flush on vmentry.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] [PATCH 1/2] L1TF KVM 1
  2018-05-29 19:42 [MODERATED] [PATCH 0/2] L1TF KVM 0 Paolo Bonzini
@ 2018-05-29 19:42 ` Paolo Bonzini
  2018-05-29 19:42 ` [MODERATED] [PATCH 2/2] L1TF KVM 2 Paolo Bonzini
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 72+ messages in thread
From: Paolo Bonzini @ 2018-05-29 19:42 UTC (permalink / raw)
  To: speck

This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".

"vmexit_l1d_flush=2" is simply doing an L1 cache flush on every vmexit.
"vmexit_l1d_flush=1" is trying to avoid so on vmexits that are "confined"
in the kind of code that they execute.  The idea is based on Intel's
patches, but the final list of "confined" vmexits isn't quite.

Notably, L1 cache flushes are performed on EPT violations (which are
basically KVM-level page faults), vmexits that involve the emulator,
and on every KVM_RUN invocation (so each userspace exit).  However,
most vmexits are considered safe.  I singled out the emulator because
it may be a good target for other speculative execution-based threats,
and the MMU because it can bring host page tables in the L1 cache.

The mitigation does not in any way try to do anything about hyperthreading;
it is possible for a sibling thread to read data from the cache during a
vmexit, before the host completes the flush, or to read data from the cache
while a sibling runs.  This part of the work is not ready yet.

For now I'm leaving the default to "vmexit_l1d_flush=2", in case we need
to push out the patches in an emergency embargo break, but I don't think
it's the best setting.  The cost is up to 2.5x more expensive vmexits
on Haswell processors, and 30% on Coffee Lake (for the latter, this is
independent of whether microcode or the generic flush code are used).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  7 ++++-
 arch/x86/kvm/mmu.c              |  1 +
 arch/x86/kvm/svm.c              |  3 +-
 arch/x86/kvm/vmx.c              | 62 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              | 54 +++++++++++++++++++++++++++++++++--
 5 files changed, 122 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c25775fad4ed..ae4bab8b1f8a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -711,6 +711,9 @@ struct kvm_vcpu_arch {
 
 	/* be preempted when it's in kernel-mode(cpl=0) */
 	bool preempted_in_kernel;
+	
+	/* for L1 terminal fault vulnerability */
+	bool vcpu_unconfined;
 };
 
 struct kvm_lpage_info {
@@ -879,6 +882,7 @@ struct kvm_vcpu_stat {
 	u64 signal_exits;
 	u64 irq_window_exits;
 	u64 nmi_window_exits;
+	u64 l1d_flush;
 	u64 halt_exits;
 	u64 halt_successful_poll;
 	u64 halt_attempted_poll;
@@ -937,7 +941,7 @@ struct kvm_x86_ops {
 	void (*vcpu_free)(struct kvm_vcpu *vcpu);
 	void (*vcpu_reset)(struct kvm_vcpu *vcpu, bool init_event);
 
-	void (*prepare_guest_switch)(struct kvm_vcpu *vcpu);
+	void (*prepare_guest_switch)(struct kvm_vcpu *vcpu, bool *need_l1d_flush);
 	void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
 	void (*vcpu_put)(struct kvm_vcpu *vcpu);
 
@@ -1446,6 +1450,7 @@ bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
 
 void kvm_set_msi_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 		     struct kvm_lapic_irq *irq);
+void kvm_l1d_flush(void);
 
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8494dbae41b9..3b1140b156b2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3836,6 +3836,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 {
 	int r = 1;
 
+	vcpu->arch.vcpu_unconfined = true;
 	switch (vcpu->arch.apf.host_apf_reason) {
 	default:
 		trace_kvm_page_fault(fault_address, error_code);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1fc05e428aba..849edcd31aad 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5404,8 +5404,9 @@ static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
 		svm->asid_generation--;
 }
 
-static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
+static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu, bool *need_l1d_flush)
 {
+	*need_l1d_flush = false;
 }
 
 static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3f1696570b41..b90ba122e73a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -71,6 +71,9 @@
 };
 MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id);
 
+static int __read_mostly vmexit_l1d_flush = 2;
+module_param_named(vmexit_l1d_flush, vmexit_l1d_flush, int, 0644);
+
 static bool __read_mostly enable_vpid = 1;
 module_param_named(vpid, enable_vpid, bool, 0444);
 
@@ -938,6 +941,8 @@ static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bit
 static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
 static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
 
+static DEFINE_PER_CPU(int, last_vector);
+
 enum {
 	VMX_VMREAD_BITMAP,
 	VMX_VMWRITE_BITMAP,
@@ -2423,6 +2428,59 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
 				   vmx->guest_msrs[i].mask);
 }
 
+static inline bool vmx_handling_confined(int reason)
+{
+	switch (reason) {
+	case EXIT_REASON_EXCEPTION_NMI:
+	case EXIT_REASON_HLT:
+	case EXIT_REASON_PAUSE_INSTRUCTION:
+	case EXIT_REASON_APIC_WRITE:
+	case EXIT_REASON_MSR_WRITE:
+	case EXIT_REASON_VMCALL:
+	case EXIT_REASON_CR_ACCESS:
+	case EXIT_REASON_DR_ACCESS:
+	case EXIT_REASON_CPUID:
+	case EXIT_REASON_PREEMPTION_TIMER:
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_EOI_INDUCED:
+	case EXIT_REASON_WBINVD:
+	case EXIT_REASON_XSETBV:
+		/*
+		 * The next three set vcpu->arch.vcpu_unconfined themselves, so
+		 * we consider them confined here.
+		 */
+	case EXIT_REASON_EPT_VIOLATION:
+	case EXIT_REASON_EPT_MISCONFIG:
+	case EXIT_REASON_IO_INSTRUCTION:
+		return true;
+	case EXIT_REASON_EXTERNAL_INTERRUPT: {
+		int cpu = raw_smp_processor_id();
+		int vector = per_cpu(last_vector, cpu);
+		return vector == LOCAL_TIMER_VECTOR || vector == RESCHEDULE_VECTOR;
+	}
+	default:
+		return false;
+	}
+}
+
+static bool vmx_core_confined(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	return vmx_handling_confined(vmx->exit_reason);
+}
+
+static void vmx_prepare_guest_switch(struct kvm_vcpu *vcpu, bool *need_l1d_flush)
+{
+	vmx_save_host_state(vcpu);
+	if (vmexit_l1d_flush == 0 || !enable_ept)
+		*need_l1d_flush = false;
+	else if (vmexit_l1d_flush == 1)
+		*need_l1d_flush |= !vmx_core_confined(vcpu);
+	else
+		*need_l1d_flush = true;
+}
+
 static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 {
 	if (!vmx->host_state.loaded)
@@ -9457,11 +9515,13 @@ static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
 		unsigned long entry;
 		gate_desc *desc;
 		struct vcpu_vmx *vmx = to_vmx(vcpu);
+		int cpu = raw_smp_processor_id();
 #ifdef CONFIG_X86_64
 		unsigned long tmp;
 #endif
 
 		vector =  exit_intr_info & INTR_INFO_VECTOR_MASK;
+		per_cpu(last_vector, cpu) = vector;
 		desc = (gate_desc *)vmx->host_idt_base + vector;
 		entry = gate_offset(desc);
 		asm volatile(
@@ -12642,7 +12702,7 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	.vcpu_free = vmx_free_vcpu,
 	.vcpu_reset = vmx_vcpu_reset,
 
-	.prepare_guest_switch = vmx_save_host_state,
+	.prepare_guest_switch = vmx_prepare_guest_switch,
 	.vcpu_load = vmx_vcpu_load,
 	.vcpu_put = vmx_vcpu_put,
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59371de5d722..ada9e55fc871 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -194,6 +194,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "irq_injections", VCPU_STAT(irq_injections) },
 	{ "nmi_injections", VCPU_STAT(nmi_injections) },
 	{ "req_event", VCPU_STAT(req_event) },
+	{ "l1d_flush", VCPU_STAT(l1d_flush) },
 	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
 	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
 	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
@@ -6026,6 +6027,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 	bool writeback = true;
 	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
 
+	vcpu->arch.vcpu_unconfined = true;
+
 	/*
 	 * Clear write_fault_to_shadow_pgtable here to ensure it is
 	 * never reused.
@@ -6509,10 +6512,40 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 };
 #endif
 
+
+#define L1D_CACHE_ORDER 3
+static void *__read_mostly empty_zero_pages;
+
+void kvm_l1d_flush(void)
+{
+	asm volatile(
+		"movq %0, %%rax\n\t"
+		"leaq 65536(%0), %%rdx\n\t"
+		"11: \n\t"
+		"movzbl (%%rax), %%ecx\n\t"
+		"addq $4096, %%rax\n\t"
+		"cmpq %%rax, %%rdx\n\t"
+		"jne 11b\n\t"
+		"xorl %%eax, %%eax\n\t"
+		"cpuid\n\t"
+		"xorl %%eax, %%eax\n\t"
+		"12:\n\t"
+		"movzwl %%ax, %%edx\n\t"
+		"addl $64, %%eax\n\t"
+		"movzbl (%%rdx, %0), %%ecx\n\t"
+		"cmpl $65536, %%eax\n\t"
+		"jne 12b\n\t"
+		"lfence\n\t"
+		:
+		: "r" (empty_zero_pages)
+		: "rax", "rbx", "rcx", "rdx");
+}
+
 int kvm_arch_init(void *opaque)
 {
 	int r;
 	struct kvm_x86_ops *ops = opaque;
+	struct page *page;
 
 	if (kvm_x86_ops) {
 		printk(KERN_ERR "kvm: already loaded the other module\n");
@@ -6532,10 +6565,15 @@ int kvm_arch_init(void *opaque)
 	}
 
 	r = -ENOMEM;
+	page = alloc_pages(GFP_ATOMIC, L1D_CACHE_ORDER);
+	if (!page)
+		goto out;
+	empty_zero_pages = page_address(page);
+
 	shared_msrs = alloc_percpu(struct kvm_shared_msrs);
 	if (!shared_msrs) {
 		printk(KERN_ERR "kvm: failed to allocate percpu kvm_shared_msrs\n");
-		goto out;
+		goto out_free_zero_pages;
 	}
 
 	r = kvm_mmu_module_init();
@@ -6566,6 +6604,8 @@ int kvm_arch_init(void *opaque)
 
 	return 0;
 
+out_free_zero_pages:
+	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
 out_free_percpu:
 	free_percpu(shared_msrs);
 out:
@@ -6590,6 +6630,7 @@ void kvm_arch_exit(void)
 #endif
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
+	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
 	free_percpu(shared_msrs);
 }
 
@@ -7233,6 +7274,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		kvm_cpu_accept_dm_intr(vcpu);
 
 	bool req_immediate_exit = false;
+	bool need_l1d_flush;
 
 	if (kvm_request_pending(vcpu)) {
 		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
@@ -7372,7 +7414,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	preempt_disable();
 
-	kvm_x86_ops->prepare_guest_switch(vcpu);
+	need_l1d_flush = vcpu->arch.vcpu_unconfined;
+	vcpu->arch.vcpu_unconfined = false;
+	kvm_x86_ops->prepare_guest_switch(vcpu, &need_l1d_flush);
+	if (need_l1d_flush) {
+		vcpu->stat.l1d_flush++;
+		kvm_l1d_flush();
+	}
 
 	/*
 	 * Disable IRQs before setting IN_GUEST_MODE.  Posted interrupt
@@ -7559,6 +7607,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 	struct kvm *kvm = vcpu->kvm;
 
 	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
+	vcpu->arch.vcpu_unconfined = true;
 
 	for (;;) {
 		if (kvm_vcpu_running(vcpu)) {
@@ -8675,6 +8724,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	vcpu->arch.vcpu_unconfined = true;
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [MODERATED] [PATCH 2/2] L1TF KVM 2
  2018-05-29 19:42 [MODERATED] [PATCH 0/2] L1TF KVM 0 Paolo Bonzini
  2018-05-29 19:42 ` [MODERATED] [PATCH 1/2] L1TF KVM 1 Paolo Bonzini
@ 2018-05-29 19:42 ` Paolo Bonzini
       [not found] ` <20180529194240.7F1336110A@crypto-ml.lab.linutronix.de>
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 72+ messages in thread
From: Paolo Bonzini @ 2018-05-29 19:42 UTC (permalink / raw)
  To: speck

Intel's new microcode adds a new feature bit in CPUID[7,0].EDX[28].
If it is active, the displacement flush is unnecessary.  Tested on
a Coffee Lake machine.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/msr-index.h   | 3 +++
 arch/x86/kvm/x86.c                 | 4 ++++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 578793e97431..aebf89c4175d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -333,6 +333,7 @@
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
+#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* IA32_FLUSH_L1D MSR */
 #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */
 
 /*
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 53d5b1b9255e..f43bd9f23053 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -65,6 +65,9 @@
 
 #define MSR_MTRRcap			0x000000fe
 
+#define MSR_IA32_FLUSH_L1D             0x10b
+#define MSR_IA32_FLUSH_L1D_VALUE       0x00000001
+
 #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a
 #define ARCH_CAP_RDCL_NO		(1 << 0)   /* Not susceptible to Meltdown */
 #define ARCH_CAP_IBRS_ALL		(1 << 1)   /* Enhanced IBRS support */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ada9e55fc871..43738283aa2a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6518,6 +6518,10 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 
 void kvm_l1d_flush(void)
 {
+	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE);
+		return;
+	}
 	asm volatile(
 		"movq %0, %%rax\n\t"
 		"leaq 65536(%0), %%rdx\n\t"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] L1TF KVM 1
       [not found] ` <20180529194240.7F1336110A@crypto-ml.lab.linutronix.de>
@ 2018-05-29 22:49   ` Thomas Gleixner
  2018-05-29 23:54     ` [MODERATED] " Andrew Cooper
                       ` (3 more replies)
  0 siblings, 4 replies; 72+ messages in thread
From: Thomas Gleixner @ 2018-05-29 22:49 UTC (permalink / raw)
  To: speck

On Tue, 29 May 2018, speck for Paolo Bonzini wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> Subject: [PATCH 1/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> 
> This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
> fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".

This is confusing at best. Why is this vmexit_l1d_flush? You flush on
VMENTER not on VMEXIT.

What you are doing is to decide whether the last exit reason requires a
flush or not. But that decision happens on VMENTER.

> Notably, L1 cache flushes are performed on EPT violations (which are
> basically KVM-level page faults), vmexits that involve the emulator,
> and on every KVM_RUN invocation (so each userspace exit).  However,
> most vmexits are considered safe.  I singled out the emulator because
> it may be a good target for other speculative execution-based threats,
> and the MMU because it can bring host page tables in the L1 cache.

What about interrupts?

> @@ -2423,6 +2428,59 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
>  				   vmx->guest_msrs[i].mask);
>  }
>  
> +static inline bool vmx_handling_confined(int reason)
> +{
> +	switch (reason) {
> +	case EXIT_REASON_EXCEPTION_NMI:
> +	case EXIT_REASON_HLT:
> +	case EXIT_REASON_PAUSE_INSTRUCTION:
> +	case EXIT_REASON_APIC_WRITE:
> +	case EXIT_REASON_MSR_WRITE:
> +	case EXIT_REASON_VMCALL:
> +	case EXIT_REASON_CR_ACCESS:
> +	case EXIT_REASON_DR_ACCESS:
> +	case EXIT_REASON_CPUID:
> +	case EXIT_REASON_PREEMPTION_TIMER:
> +	case EXIT_REASON_MSR_READ:
> +	case EXIT_REASON_EOI_INDUCED:
> +	case EXIT_REASON_WBINVD:
> +	case EXIT_REASON_XSETBV:
> +		/*
> +		 * The next three set vcpu->arch.vcpu_unconfined themselves, so
> +		 * we consider them confined here.

What's the logic behind that?

> +		 */
> +	case EXIT_REASON_EPT_VIOLATION:
> +	case EXIT_REASON_EPT_MISCONFIG:
> +	case EXIT_REASON_IO_INSTRUCTION:
> +		return true;
> +	case EXIT_REASON_EXTERNAL_INTERRUPT: {
> +		int cpu = raw_smp_processor_id();
> +		int vector = per_cpu(last_vector, cpu);
> +		return vector == LOCAL_TIMER_VECTOR || vector == RESCHEDULE_VECTOR;

That wants a comment why these two are considered safe.

The timer vector is a doubtful one. It does not necessarily cause a
reschedule and it can run arbitrary softirq code on interrupt return. I
wouldn't be that sure that it's safe.

> +	}
> +	default:
> +		return false;
> +	}
> +}
> +
> +static bool vmx_core_confined(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	return vmx_handling_confined(vmx->exit_reason);
> +}
> +
> +static void vmx_prepare_guest_switch(struct kvm_vcpu *vcpu, bool *need_l1d_flush)
> +{
> +	vmx_save_host_state(vcpu);
> +	if (vmexit_l1d_flush == 0 || !enable_ept)
> +		*need_l1d_flush = false;
> +	else if (vmexit_l1d_flush == 1)
> +		*need_l1d_flush |= !vmx_core_confined(vcpu);

This inverted logic does not make the code more readable. It's obfuscation
for no value.

> +	else
> +		*need_l1d_flush = true;
> +}

> @@ -9457,11 +9515,13 @@ static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
> 		unsigned long entry;
> 		gate_desc *desc;
>		struct vcpu_vmx *vmx = to_vmx(vcpu);
> +		int cpu = raw_smp_processor_id();
>  #ifdef CONFIG_X86_64
>  		unsigned long tmp;
>  #endif
>  
> 		vector =  exit_intr_info & INTR_INFO_VECTOR_MASK;
> +		per_cpu(last_vector, cpu) = vector;

Why aren't you doing the evaluation of the vector right here and set the
unconfined bit instead of having yet another indirection and probably
another cache line for that per_cpu() storage? That does not make any
sense at all.

>		desc = (gate_desc *)vmx->host_idt_base + vector;
>		entry = gate_offset(desc);

> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6509,10 +6512,40 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
>  };
>  #endif
>  
> +
> +#define L1D_CACHE_ORDER 3

This should use the cache size information and not a hard coded value I think. 

> +static void *__read_mostly empty_zero_pages;
> +
> +void kvm_l1d_flush(void)
> +{
> +	asm volatile(
> +		"movq %0, %%rax\n\t"
> +		"leaq 65536(%0), %%rdx\n\t"

Why 64K?

> +		"11: \n\t"
> +		"movzbl (%%rax), %%ecx\n\t"
> +		"addq $4096, %%rax\n\t"
> +		"cmpq %%rax, %%rdx\n\t"
> +		"jne 11b\n\t"
> +		"xorl %%eax, %%eax\n\t"
> +		"cpuid\n\t"

What's the cpuid invocation for?

> +		"xorl %%eax, %%eax\n\t"
> +		"12:\n\t"
> +		"movzwl %%ax, %%edx\n\t"
> +		"addl $64, %%eax\n\t"
> +		"movzbl (%%rdx, %0), %%ecx\n\t"
> +		"cmpl $65536, %%eax\n\t"

And this whole magic should be documented.

> +		"jne 12b\n\t"
> +		"lfence\n\t"
> +		:
> +		: "r" (empty_zero_pages)
> +		: "rax", "rbx", "rcx", "rdx");

How is that supposed to compile on 32bit?

> +}

Aside of that do we really need that manual flush thingy? Is that ucode
update going to take forever?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-05-29 22:49   ` [PATCH 1/2] L1TF KVM 1 Thomas Gleixner
@ 2018-05-29 23:54     ` Andrew Cooper
  2018-05-30  9:01       ` Paolo Bonzini
  2018-05-30  8:55     ` [MODERATED] " Peter Zijlstra
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 72+ messages in thread
From: Andrew Cooper @ 2018-05-29 23:54 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1698 bytes --]

On 29/05/2018 23:49, speck for Thomas Gleixner wrote:
> On Tue, 29 May 2018, speck for Paolo Bonzini wrote:
>> +static void *__read_mostly empty_zero_pages;
>> +
>> +void kvm_l1d_flush(void)
>> +{
>> +	asm volatile(
>> +		"movq %0, %%rax\n\t"
>> +		"leaq 65536(%0), %%rdx\n\t"
> Why 64K?
>
>> +		"11: \n\t"
>> +		"movzbl (%%rax), %%ecx\n\t"
>> +		"addq $4096, %%rax\n\t"
>> +		"cmpq %%rax, %%rdx\n\t"
>> +		"jne 11b\n\t"
>> +		"xorl %%eax, %%eax\n\t"
>> +		"cpuid\n\t"
> What's the cpuid invocation for?
>
>> +		"xorl %%eax, %%eax\n\t"
>> +		"12:\n\t"
>> +		"movzwl %%ax, %%edx\n\t"
>> +		"addl $64, %%eax\n\t"
>> +		"movzbl (%%rdx, %0), %%ecx\n\t"
>> +		"cmpl $65536, %%eax\n\t"
> And this whole magic should be documented.
>
>> +		"jne 12b\n\t"
>> +		"lfence\n\t"
>> +		:
>> +		: "r" (empty_zero_pages)
>> +		: "rax", "rbx", "rcx", "rdx");
> How is that supposed to compile on 32bit?
>
>> +}
> Aside of that do we really need that manual flush thingy? Is that ucode
> update going to take forever?

I already provided feedback on this software loop, but have had radio
silence as a result.  The CPUID is serialisation (best as I can guess)
to terminate any WC buffer which got hit, but this is going to truly
suck inside a VM.  If it is for full serialisation properties, the least
overhead option would be a write to %cr2, which is serialising on all
affected parts.

Other bits I don't understand are the 64k limit in the first place, why
it gets walked over in 4k strides to begin with (I'm not aware of any
prefetching which would benefit that...) and why a particularly
obfuscated piece of magic is used for the 64byte strides.

~Andrew


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 2/2] L1TF KVM 2
       [not found] ` <20180529194322.8B56F610F8@crypto-ml.lab.linutronix.de>
@ 2018-05-29 23:59   ` Andrew Cooper
  2018-05-30  8:38     ` Thomas Gleixner
  0 siblings, 1 reply; 72+ messages in thread
From: Andrew Cooper @ 2018-05-29 23:59 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]

On 29/05/2018 20:42, speck for Paolo Bonzini wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> Subject: [PATCH 2/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
>
> Intel's new microcode adds a new feature bit in CPUID[7,0].EDX[28].
> If it is active, the displacement flush is unnecessary.  Tested on
> a Coffee Lake machine.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/include/asm/cpufeatures.h | 1 +
>  arch/x86/include/asm/msr-index.h   | 3 +++
>  arch/x86/kvm/x86.c                 | 4 ++++
>  3 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 578793e97431..aebf89c4175d 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -333,6 +333,7 @@
>  #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
>  #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
>  #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
> +#define X86_FEATURE_FLUSH_L1D		(18*32+28) /* IA32_FLUSH_L1D MSR */
>  #define X86_FEATURE_ARCH_CAPABILITIES	(18*32+29) /* IA32_ARCH_CAPABILITIES MSR (Intel) */
>  
>  /*
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 53d5b1b9255e..f43bd9f23053 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -65,6 +65,9 @@
>  
>  #define MSR_MTRRcap			0x000000fe
>  
> +#define MSR_IA32_FLUSH_L1D             0x10b
> +#define MSR_IA32_FLUSH_L1D_VALUE       0x00000001
> +
>  #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a
>  #define ARCH_CAP_RDCL_NO		(1 << 0)   /* Not susceptible to Meltdown */
>  #define ARCH_CAP_IBRS_ALL		(1 << 1)   /* Enhanced IBRS support */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ada9e55fc871..43738283aa2a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6518,6 +6518,10 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
>  
>  void kvm_l1d_flush(void)
>  {
> +	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> +		wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE);
> +		return;
> +	}

What is this supposed to achieve?  Sure, most of the cache content has
disappeared, but some of the most interesting parts are still left
available to the guest.

In Xen, I've come to the conclusion that the only viable option here is
for a guest load-only MSR entry.  Without this, the GPRs at the most
recent vmentry are still available in the cache (as there is no way for
the hypervisor to rationally zero them), which results in a guest kenrel
=> guest user leak if a vmentry occurs late in the guest kernels
return-to-userspace path.

A guest kernel which is implementing the PTE mitigation is immune to
this attack, but the hypervisor does not know a priori whether this is
the case or not.

~Andrew


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] L1TF KVM 2
  2018-05-29 23:59   ` [MODERATED] Re: [PATCH 2/2] L1TF KVM 2 Andrew Cooper
@ 2018-05-30  8:38     ` Thomas Gleixner
  2018-05-30 16:57       ` [MODERATED] " Andrew Cooper
  0 siblings, 1 reply; 72+ messages in thread
From: Thomas Gleixner @ 2018-05-30  8:38 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 812 bytes --]

On Wed, 30 May 2018, speck for Andrew Cooper wrote:
> On 29/05/2018 20:42, speck for Paolo Bonzini wrote:
> In Xen, I've come to the conclusion that the only viable option here is
> for a guest load-only MSR entry.  Without this, the GPRs at the most
> recent vmentry are still available in the cache (as there is no way for
> the hypervisor to rationally zero them), which results in a guest kenrel
> => guest user leak if a vmentry occurs late in the guest kernels
> return-to-userspace path.
> 
> A guest kernel which is implementing the PTE mitigation is immune to
> this attack, but the hypervisor does not know a priori whether this is
> the case or not.

And why would you care about some outdated guest kernel, which is
vulnerable against lots of other stuff as well?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-05-29 22:49   ` [PATCH 1/2] L1TF KVM 1 Thomas Gleixner
  2018-05-29 23:54     ` [MODERATED] " Andrew Cooper
@ 2018-05-30  8:55     ` Peter Zijlstra
  2018-05-30  9:02     ` Paolo Bonzini
  2018-05-31 19:00     ` Jon Masters
  3 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2018-05-30  8:55 UTC (permalink / raw)
  To: speck

On Wed, May 30, 2018 at 12:49:55AM +0200, speck for Thomas Gleixner wrote:
> > +	case EXIT_REASON_EXTERNAL_INTERRUPT: {
> > +		int cpu = raw_smp_processor_id();
> > +		int vector = per_cpu(last_vector, cpu);
> > +		return vector == LOCAL_TIMER_VECTOR || vector == RESCHEDULE_VECTOR;
> 
> That wants a comment why these two are considered safe.
> 
> The timer vector is a doubtful one. It does not necessarily cause a
> reschedule and it can run arbitrary softirq code on interrupt return. I
> wouldn't be that sure that it's safe.

reschedule can also cause softirq to run.

And there's just no way we can guarantee softirq doesn't do something
sensitive, like IPsec processing or whatever.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-05-29 23:54     ` [MODERATED] " Andrew Cooper
@ 2018-05-30  9:01       ` Paolo Bonzini
  2018-05-30 11:58         ` Martin Pohlack
  2018-06-04  8:24         ` [MODERATED] " Martin Pohlack
  0 siblings, 2 replies; 72+ messages in thread
From: Paolo Bonzini @ 2018-05-30  9:01 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 488 bytes --]

On 30/05/2018 01:54, speck for Andrew Cooper wrote:
> Other bits I don't understand are the 64k limit in the first place, why
> it gets walked over in 4k strides to begin with (I'm not aware of any
> prefetching which would benefit that...) and why a particularly
> obfuscated piece of magic is used for the 64byte strides.

That is the only part I understood, :) the 4k strides ensure that the
source data is in the TLB.  Why that is needed is still a mystery though.

Paolo


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-05-29 22:49   ` [PATCH 1/2] L1TF KVM 1 Thomas Gleixner
  2018-05-29 23:54     ` [MODERATED] " Andrew Cooper
  2018-05-30  8:55     ` [MODERATED] " Peter Zijlstra
@ 2018-05-30  9:02     ` Paolo Bonzini
  2018-05-31 19:00     ` Jon Masters
  3 siblings, 0 replies; 72+ messages in thread
From: Paolo Bonzini @ 2018-05-30  9:02 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 3853 bytes --]

On 30/05/2018 00:49, speck for Thomas Gleixner wrote:
> On Tue, 29 May 2018, speck for Paolo Bonzini wrote:
>> From: Paolo Bonzini <pbonzini@redhat.com>
>> Subject: [PATCH 1/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
>>
>> This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
>> fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".
> 
> This is confusing at best. Why is this vmexit_l1d_flush? You flush on
> VMENTER not on VMEXIT.

Yes, that makes sense.

>> Notably, L1 cache flushes are performed on EPT violations (which are
>> basically KVM-level page faults), vmexits that involve the emulator,
>> and on every KVM_RUN invocation (so each userspace exit).  However,
>> most vmexits are considered safe.  I singled out the emulator because
>> it may be a good target for other speculative execution-based threats,
>> and the MMU because it can bring host page tables in the L1 cache.
> 
> What about interrupts?

Will fix the commit message and rework the patch to set vcpu_unconfined
at vmexit time.

>> +		/*
>> +		 * The next three set vcpu->arch.vcpu_unconfined themselves, so
>> +		 * we consider them confined here.
> 
> What's the logic behind that?
> 
>> +		 */
>> +	case EXIT_REASON_EPT_VIOLATION:
>> +	case EXIT_REASON_EPT_MISCONFIG:
>> +	case EXIT_REASON_IO_INSTRUCTION:
>> +		return true;

EPT misconfig and I/O instruction are very frequent, and most of the
time they run just a small fast path.  EPT violation can be put together
with the "always slow" ones.

>> +	case EXIT_REASON_EXTERNAL_INTERRUPT: {
>> +		int cpu = raw_smp_processor_id();
>> +		int vector = per_cpu(last_vector, cpu);
>> +		return vector == LOCAL_TIMER_VECTOR || vector == RESCHEDULE_VECTOR;
> 
> That wants a comment why these two are considered safe.
> 
> The timer vector is a doubtful one. It does not necessarily cause a
> reschedule and it can run arbitrary softirq code on interrupt return. I
> wouldn't be that sure that it's safe.

It's also the most frequent one. :(  But I see your and Peter's point,
I'll drop it and consider all external interrupts to be unconfined.

May there could be some kind of "ran softirq" generation count.

>> @@ -9457,11 +9515,13 @@ static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
>> 		unsigned long entry;
>> 		gate_desc *desc;
>> 		struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +		int cpu = raw_smp_processor_id();
>>  #ifdef CONFIG_X86_64
>>  		unsigned long tmp;
>>  #endif
>>  
>> 		vector =  exit_intr_info & INTR_INFO_VECTOR_MASK;
>> +		per_cpu(last_vector, cpu) = vector;
> 
> Why aren't you doing the evaluation of the vector right here and set the
> unconfined bit instead of having yet another indirection and probably
> another cache line for that per_cpu() storage? That does not make any
> sense at all.

Because that's how the patches I got from Intel did it, and I kind of
liked keeping all the logic in one function.  But it's going to go away,
it's not safe.

>> 		desc = (gate_desc *)vmx->host_idt_base + vector;
>> 		entry = gate_offset(desc);
> 
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6509,10 +6512,40 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
>>  };
>>  #endif
>>  
>> +
>> +#define L1D_CACHE_ORDER 3
> 
> This should use the cache size information and not a hard coded value I think. 

Plus it seems wrong.  Order 3 is 32K, not 64K, isn't it? :/

> And this whole magic should be documented.
> 
> Aside of that do we really need that manual flush thingy? Is that ucode
> update going to take forever?

As Andrew said, this was also copied from Intel's patch, assuming they
knew what they were doing.  I'll just drop it and just use the microcode
MSR.

Paolo


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-05-30  9:01       ` Paolo Bonzini
@ 2018-05-30 11:58         ` Martin Pohlack
  2018-05-30 12:25           ` Thomas Gleixner
  2018-06-04  8:24         ` [MODERATED] " Martin Pohlack
  1 sibling, 1 reply; 72+ messages in thread
From: Martin Pohlack @ 2018-05-30 11:58 UTC (permalink / raw)
  To: speck

On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>> Other bits I don't understand are the 64k limit in the first place, why
>> it gets walked over in 4k strides to begin with (I'm not aware of any
>> prefetching which would benefit that...) and why a particularly
>> obfuscated piece of magic is used for the 64byte strides.
> 
> That is the only part I understood, :) the 4k strides ensure that the
> source data is in the TLB.  Why that is needed is still a mystery though.

I think the reasoning is that you first want to populate the TLB for the
whole flush array, then fence, to make sure TLB walks do not interfere
with the actual flushing later, either for performance reasons or for
preventing leakage of partial walk results.

Not sure about the 64K, it likely is about the LRU implementation for L1
replacement not being perfect (but pseudo LRU), so you need to flush
more than the L1 size (32K) in software.  But I have also seen smaller
recommendations for that (52K).

Martin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] L1TF KVM 1
  2018-05-30 11:58         ` Martin Pohlack
@ 2018-05-30 12:25           ` Thomas Gleixner
  2018-05-30 14:31             ` Thomas Gleixner
  0 siblings, 1 reply; 72+ messages in thread
From: Thomas Gleixner @ 2018-05-30 12:25 UTC (permalink / raw)
  To: speck

On Wed, 30 May 2018, speck for Martin Pohlack wrote:

> -----BEGIN PGP MESSAGE-----
> Charset: windows-1252
> Version: GnuPG v2

Sorry the remailer failed to decrypt that message. Investigating.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] L1TF KVM 1
  2018-05-30 12:25           ` Thomas Gleixner
@ 2018-05-30 14:31             ` Thomas Gleixner
  0 siblings, 0 replies; 72+ messages in thread
From: Thomas Gleixner @ 2018-05-30 14:31 UTC (permalink / raw)
  To: speck

On Wed, 30 May 2018, speck for Thomas Gleixner wrote:
> On Wed, 30 May 2018, speck for Martin Pohlack wrote:
> 
> > -----BEGIN PGP MESSAGE-----
> > Charset: windows-1252
> > Version: GnuPG v2
> 
> Sorry the remailer failed to decrypt that message. Investigating.

It was missing a decoding from quoted-printable which is required due to
charset = windows-1252. Fixed and replayed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-29 21:14                     ` L1D-Fault KVM mitigation Thomas Gleixner
@ 2018-05-30 16:38                       ` Tim Chen
  0 siblings, 0 replies; 72+ messages in thread
From: Tim Chen @ 2018-05-30 16:38 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

[-- Attachment #2: Type: text/plain, Size: 1626 bytes --]

On 05/29/2018 02:14 PM, speck for Thomas Gleixner wrote:
> 
>> yes, with compile workload, the HT speedup was mostly eaten up by
>> overhead.
> 
> So where is the point of the exercise?
> 
> You will not find a generic solution for this problem ever simply because
> the workloads and guest scenarios are too different. There are clearly
> scenarios which can benefit, but at the same time there are scenarios which
> will be way worse off than with SMT disabled.
> 
> I completely understand that Intel wants to avoid the 'disable SMT'
> solution by all means, but this cannot be done with something which is
> obvioulsy creating more problems than it solves in the first place.
> 
> At some point reality has to kick in and you have to admit that there is no
> generic solution and the only solution for a lot of use cases will be to
> disable SMT. The solution for special workloads like the fully partitioned
> ones David mentioned do not need the extra mess all over the place
> especially not when there is ucode assist at least to the point which fits
> into the patch space and some of it really should not take a huge amount of
> effort, like the forced sibling vmexit to avoid the whole IPI machinery.
> 

Having to sync on VM entry and on VM exit and on interrupt to idle sibling
sucks. Hopefully the ucode guys can come up with something
to provide an option that forces the sibling to vmexit on vmexit,
and on interrupt to idle sibling. This should cut the sync overhead in half.
Then only VM entry needs to be synced should we still want to
do co-scheduling.

Thanks.

Tim



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 2/2] L1TF KVM 2
  2018-05-30  8:38     ` Thomas Gleixner
@ 2018-05-30 16:57       ` Andrew Cooper
  2018-05-30 19:11         ` Thomas Gleixner
  0 siblings, 1 reply; 72+ messages in thread
From: Andrew Cooper @ 2018-05-30 16:57 UTC (permalink / raw)
  To: speck

On 30/05/18 09:38, speck for Thomas Gleixner wrote:
> On Wed, 30 May 2018, speck for Andrew Cooper wrote:
>> On 29/05/2018 20:42, speck for Paolo Bonzini wrote:
>> In Xen, I've come to the conclusion that the only viable option here is
>> for a guest load-only MSR entry.  Without this, the GPRs at the most
>> recent vmentry are still available in the cache (as there is no way for
>> the hypervisor to rationally zero them), which results in a guest kenrel
>> => guest user leak if a vmentry occurs late in the guest kernels
>> return-to-userspace path.
>>
>> A guest kernel which is implementing the PTE mitigation is immune to
>> this attack, but the hypervisor does not know a priori whether this is
>> the case or not.
> And why would you care about some outdated guest kernel, which is
> vulnerable against lots of other stuff as well?

That's a very good point.  It need not matter for the guest kernel =>
user leak.

However, to avoid leaking other hypervisor data into guest context, you
need to account for the possibility of interrupts/exceptions late in the
vmentry path, which includes an NMI hitting on the instruction boundary
before vmresume.  Failure to account for this case will, most easily,
leak hypervisor GPRs into guest context.

Unlike MSR_SPEC_CTRL (which can be written "shortly before the
iret/vmentry"), issuing the flush before restoring GPRs is ineffective
at preventing leakage, and a write to MSR_FLUSH_CMD cannot be performed
after restoring GPRs, other than with a MSR guest load list entry.

~Andrew

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] L1TF KVM 2
  2018-05-30 16:57       ` [MODERATED] " Andrew Cooper
@ 2018-05-30 19:11         ` Thomas Gleixner
  2018-05-30 21:10           ` [MODERATED] " Andi Kleen
  0 siblings, 1 reply; 72+ messages in thread
From: Thomas Gleixner @ 2018-05-30 19:11 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1646 bytes --]

On Wed, 30 May 2018, speck for Andrew Cooper wrote:

> On 30/05/18 09:38, speck for Thomas Gleixner wrote:
> > On Wed, 30 May 2018, speck for Andrew Cooper wrote:
> >> On 29/05/2018 20:42, speck for Paolo Bonzini wrote:
> >> In Xen, I've come to the conclusion that the only viable option here is
> >> for a guest load-only MSR entry.  Without this, the GPRs at the most
> >> recent vmentry are still available in the cache (as there is no way for
> >> the hypervisor to rationally zero them), which results in a guest kenrel
> >> => guest user leak if a vmentry occurs late in the guest kernels
> >> return-to-userspace path.
> >>
> >> A guest kernel which is implementing the PTE mitigation is immune to
> >> this attack, but the hypervisor does not know a priori whether this is
> >> the case or not.
> > And why would you care about some outdated guest kernel, which is
> > vulnerable against lots of other stuff as well?
> 
> That's a very good point.  It need not matter for the guest kernel =>
> user leak.
> 
> However, to avoid leaking other hypervisor data into guest context, you
> need to account for the possibility of interrupts/exceptions late in the
> vmentry path, which includes an NMI hitting on the instruction boundary
> before vmresume.  Failure to account for this case will, most easily,
> leak hypervisor GPRs into guest context.

Right, but you really have to make a judgement whether this leak is useful
and can be orchestrated by the attacker in the guest. If it's just the
theoretical cornercase with no practical attack surface then you really can
leave that as an exercise for ivory tower people.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 2/2] L1TF KVM 2
  2018-05-30 19:11         ` Thomas Gleixner
@ 2018-05-30 21:10           ` Andi Kleen
  2018-05-30 23:19             ` Andrew Cooper
  0 siblings, 1 reply; 72+ messages in thread
From: Andi Kleen @ 2018-05-30 21:10 UTC (permalink / raw)
  To: speck

> Right, but you really have to make a judgement whether this leak is useful
> and can be orchestrated by the attacker in the guest. If it's just the
> theoretical cornercase with no practical attack surface then you really can
> leave that as an exercise for ivory tower people.

We care about user data and kernel secrets like keys.

NMIs don't have either.

(except for profile NMIs, but they only have the data of what
you just interrupted)

Same for machine checks etc. 

In fact most hard interrupts are likely uninteresting (except those
that copy user data), although soft interrupts may not be.

-Andi

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 2/2] L1TF KVM 2
  2018-05-30 21:10           ` [MODERATED] " Andi Kleen
@ 2018-05-30 23:19             ` Andrew Cooper
  0 siblings, 0 replies; 72+ messages in thread
From: Andrew Cooper @ 2018-05-30 23:19 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1581 bytes --]

On 30/05/2018 22:10, speck for Andi Kleen wrote:
>> Right, but you really have to make a judgement whether this leak is useful
>> and can be orchestrated by the attacker in the guest. If it's just the
>> theoretical cornercase with no practical attack surface then you really can
>> leave that as an exercise for ivory tower people.
> We care about user data and kernel secrets like keys.
>
> NMIs don't have either.
>
> (except for profile NMIs, but they only have the data of what
> you just interrupted)
>
> Same for machine checks etc. 
>
> In fact most hard interrupts are likely uninteresting (except those
> that copy user data), although soft interrupts may not be.

Until we have instructions for how to turn off next-page prefetch
(Haswell and later, I believe), *any* memory access is liable to pull in
unrelated data from adjacent pages.  Accesses into the directmap are
liable to pull in data from a completely unrelated context, which in
Xen's case has a reasonable chance of being data from another VM, and in
Linux's case, data from another process.

Even with regular prefetching, you're highly likely to pull in a cache
line or two from an adjacent heap object.

Maybe I am being too pessimistic, but at this point I don't buy the
argument of "architecturally, we don't touch any sensitive data,
therefore it won't be the cache".  Whether said data is usefully
extractable by an attacker is a different matter, but I don't feel
comfortable betting peoples data isolation on the expectation that it
isn't extractable.

~Andrew

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-05-29 22:49   ` [PATCH 1/2] L1TF KVM 1 Thomas Gleixner
                       ` (2 preceding siblings ...)
  2018-05-30  9:02     ` Paolo Bonzini
@ 2018-05-31 19:00     ` Jon Masters
  3 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2018-05-31 19:00 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1145 bytes --]

On 05/29/2018 06:49 PM, speck for Thomas Gleixner wrote:
> On Tue, 29 May 2018, speck for Paolo Bonzini wrote:

>> +void kvm_l1d_flush(void)
>> +{
>> +	asm volatile(
>> +		"movq %0, %%rax\n\t"
>> +		"leaq 65536(%0), %%rdx\n\t"
> 
> Why 64K?
> 
>> +		"11: \n\t"
>> +		"movzbl (%%rax), %%ecx\n\t"
>> +		"addq $4096, %%rax\n\t"
>> +		"cmpq %%rax, %%rdx\n\t"
>> +		"jne 11b\n\t"
>> +		"xorl %%eax, %%eax\n\t"
>> +		"cpuid\n\t"

My guess is they're saying that the maximum L1D$ size is 64K so they
want to stride it 1 4K page at a time to get the prefetchers going ahead
of the next loop...this in theory will make the following loop "faster".

> What's the cpuid invocation for?
> 
>> +		"xorl %%eax, %%eax\n\t"
>> +		"12:\n\t"
>> +		"movzwl %%ax, %%edx\n\t"
>> +		"addl $64, %%eax\n\t"
>> +		"movzbl (%%rdx, %0), %%ecx\n\t">> +		"cmpl $65536, %%eax\n\t"

...which then tries to do 64 bytes (Intel cache line) at a time.

They use the CPUID as a serializing instruction to ensure the store has
been observed, others have commented on that.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
       [not found] ` <20180529194239.768D561107@crypto-ml.lab.linutronix.de>
@ 2018-06-01 16:48   ` Konrad Rzeszutek Wilk
  2018-06-04 14:56     ` Paolo Bonzini
  0 siblings, 1 reply; 72+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-06-01 16:48 UTC (permalink / raw)
  To: speck

On Tue, May 29, 2018 at 09:42:13PM +0200, speck for Paolo Bonzini wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> Subject: [PATCH 1/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> 
> This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
> fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".
> 
> "vmexit_l1d_flush=2" is simply doing an L1 cache flush on every vmexit.
> "vmexit_l1d_flush=1" is trying to avoid so on vmexits that are "confined"
> in the kind of code that they execute.  The idea is based on Intel's
> patches, but the final list of "confined" vmexits isn't quite.

s/quite/quite fleshed out/
> 
> Notably, L1 cache flushes are performed on EPT violations (which are
> basically KVM-level page faults), vmexits that involve the emulator,
> and on every KVM_RUN invocation (so each userspace exit).  However,
> most vmexits are considered safe.  I singled out the emulator because
> it may be a good target for other speculative execution-based threats,
> and the MMU because it can bring host page tables in the L1 cache.
> 
> The mitigation does not in any way try to do anything about hyperthreading;
> it is possible for a sibling thread to read data from the cache during a
> vmexit, before the host completes the flush, or to read data from the cache
> while a sibling runs.  This part of the work is not ready yet.
> 
> For now I'm leaving the default to "vmexit_l1d_flush=2", in case we need
> to push out the patches in an emergency embargo break, but I don't think
> it's the best setting.  The cost is up to 2.5x more expensive vmexits
> on Haswell processors, and 30% on Coffee Lake (for the latter, this is
> independent of whether microcode or the generic flush code are used).
> 

This looks very much like what Jun had. If this was based off Intel code would
it make sense to give credit to him by saying something along:

"Ideas and some code based off from Jun .." ?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] [PATCH 1/2] L1TF KVM 1
  2018-05-30  9:01       ` Paolo Bonzini
  2018-05-30 11:58         ` Martin Pohlack
@ 2018-06-04  8:24         ` Martin Pohlack
  2018-06-04 13:11           ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 72+ messages in thread
From: Martin Pohlack @ 2018-06-04  8:24 UTC (permalink / raw)
  To: speck

[resending as new message as the replay seems to have been lost on at
least some mail paths]

On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>> Other bits I don't understand are the 64k limit in the first place, why
>> it gets walked over in 4k strides to begin with (I'm not aware of any
>> prefetching which would benefit that...) and why a particularly
>> obfuscated piece of magic is used for the 64byte strides.
> 
> That is the only part I understood, :) the 4k strides ensure that the
> source data is in the TLB.  Why that is needed is still a mystery though.

I think the reasoning is that you first want to populate the TLB for the
whole flush array, then fence, to make sure TLB walks do not interfere
with the actual flushing later, either for performance reasons or for
preventing leakage of partial walk results.

Not sure about the 64K, it likely is about the LRU implementation for L1
replacement not being perfect (but pseudo LRU), so you need to flush
more than the L1 size (32K) in software.  But I have also seen smaller
recommendations for that (52K).

Martin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1
  2018-06-04  8:24         ` [MODERATED] " Martin Pohlack
@ 2018-06-04 13:11           ` Konrad Rzeszutek Wilk
  2018-06-04 17:59             ` [MODERATED] Encrypted Message Tim Chen
                               ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-06-04 13:11 UTC (permalink / raw)
  To: speck

On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
> [resending as new message as the replay seems to have been lost on at
> least some mail paths]
> 
> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
> > On 30/05/2018 01:54, speck for Andrew Cooper wrote:
> >> Other bits I don't understand are the 64k limit in the first place, why
> >> it gets walked over in 4k strides to begin with (I'm not aware of any
> >> prefetching which would benefit that...) and why a particularly
> >> obfuscated piece of magic is used for the 64byte strides.
> >
> > That is the only part I understood, :) the 4k strides ensure that the
> > source data is in the TLB.  Why that is needed is still a mystery though.
> 
> I think the reasoning is that you first want to populate the TLB for the
> whole flush array, then fence, to make sure TLB walks do not interfere
> with the actual flushing later, either for performance reasons or for
> preventing leakage of partial walk results.
> 
> Not sure about the 64K, it likely is about the LRU implementation for L1
> replacement not being perfect (but pseudo LRU), so you need to flush
> more than the L1 size (32K) in software.  But I have also seen smaller
> recommendations for that (52K).

Isn't Tim Chen from Intel on this mailing list? Tim, could you find out
please?

> 
> Martin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-06-01 16:48   ` [MODERATED] Re: [PATCH 1/2] L1TF KVM 1 Konrad Rzeszutek Wilk
@ 2018-06-04 14:56     ` Paolo Bonzini
  0 siblings, 0 replies; 72+ messages in thread
From: Paolo Bonzini @ 2018-06-04 14:56 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2491 bytes --]

On 01/06/2018 18:48, speck for Konrad Rzeszutek Wilk wrote:
> On Tue, May 29, 2018 at 09:42:13PM +0200, speck for Paolo Bonzini wrote:
>> From: Paolo Bonzini <pbonzini@redhat.com>
>> Subject: [PATCH 1/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
>>
>> This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
>> fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".
>>
>> "vmexit_l1d_flush=2" is simply doing an L1 cache flush on every vmexit.
>> "vmexit_l1d_flush=1" is trying to avoid so on vmexits that are "confined"
>> in the kind of code that they execute.  The idea is based on Intel's
>> patches, but the final list of "confined" vmexits isn't quite.
> 
> s/quite/quite fleshed out/

No, not quite based on Intel's patches. :)

>>
>> Notably, L1 cache flushes are performed on EPT violations (which are
>> basically KVM-level page faults), vmexits that involve the emulator,
>> and on every KVM_RUN invocation (so each userspace exit).  However,
>> most vmexits are considered safe.  I singled out the emulator because
>> it may be a good target for other speculative execution-based threats,
>> and the MMU because it can bring host page tables in the L1 cache.
>>
>> The mitigation does not in any way try to do anything about hyperthreading;
>> it is possible for a sibling thread to read data from the cache during a
>> vmexit, before the host completes the flush, or to read data from the cache
>> while a sibling runs.  This part of the work is not ready yet.
>>
>> For now I'm leaving the default to "vmexit_l1d_flush=2", in case we need
>> to push out the patches in an emergency embargo break, but I don't think
>> it's the best setting.  The cost is up to 2.5x more expensive vmexits
>> on Haswell processors, and 30% on Coffee Lake (for the latter, this is
>> independent of whether microcode or the generic flush code are used).
>>
> 
> This looks very much like what Jun had. If this was based off Intel code would
> it make sense to give credit to him by saying something along:
> 
> "Ideas and some code based off from Jun .." ?

I was not sure who the author of Intel's code was, so I left in the
references above.  Really the only thing I lifted from there was the L1
flush; the "confined" moniker got stuck in my brain, but all the other
code was rewritten from scratch.  I'll resend a new version with review
comments applied tomorrow or Wednesday.

Paolo


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-04 13:11           ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
@ 2018-06-04 17:59             ` Tim Chen
  2018-06-05  1:25             ` [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1 Jon Masters
  2018-06-05 23:34             ` [MODERATED] Encrypted Message Tim Chen
  2 siblings, 0 replies; 72+ messages in thread
From: Tim Chen @ 2018-06-04 17:59 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 165 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

[-- Attachment #2: Type: text/plain, Size: 1464 bytes --]

On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>> [resending as new message as the replay seems to have been lost on at
>> least some mail paths]
>>
>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>> prefetching which would benefit that...) and why a particularly
>>>> obfuscated piece of magic is used for the 64byte strides.
>>>
>>> That is the only part I understood, :) the 4k strides ensure that the
>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>
>> I think the reasoning is that you first want to populate the TLB for the
>> whole flush array, then fence, to make sure TLB walks do not interfere
>> with the actual flushing later, either for performance reasons or for
>> preventing leakage of partial walk results.
>>
>> Not sure about the 64K, it likely is about the LRU implementation for L1
>> replacement not being perfect (but pseudo LRU), so you need to flush
>> more than the L1 size (32K) in software.  But I have also seen smaller
>> recommendations for that (52K).
> 
> Isn't Tim Chen from Intel on this mailing list? Tim, could you find out
> please?
> 

Will do.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1
  2018-06-04 13:11           ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
  2018-06-04 17:59             ` [MODERATED] Encrypted Message Tim Chen
@ 2018-06-05  1:25             ` Jon Masters
  2018-06-05  1:30               ` Linus Torvalds
  2018-06-05  7:10               ` Martin Pohlack
  2018-06-05 23:34             ` [MODERATED] Encrypted Message Tim Chen
  2 siblings, 2 replies; 72+ messages in thread
From: Jon Masters @ 2018-06-05  1:25 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1848 bytes --]

On 06/04/2018 09:11 AM, speck for Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>> [resending as new message as the replay seems to have been lost on at
>> least some mail paths]
>>
>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>> prefetching which would benefit that...) and why a particularly
>>>> obfuscated piece of magic is used for the 64byte strides.
>>>
>>> That is the only part I understood, :) the 4k strides ensure that the
>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>
>> I think the reasoning is that you first want to populate the TLB for the
>> whole flush array, then fence, to make sure TLB walks do not interfere
>> with the actual flushing later, either for performance reasons or for
>> preventing leakage of partial walk results.
>>
>> Not sure about the 64K, it likely is about the LRU implementation for L1
>> replacement not being perfect (but pseudo LRU), so you need to flush
>> more than the L1 size (32K) in software.  But I have also seen smaller
>> recommendations for that (52K).
> 
> Isn't Tim Chen from Intel on this mailing list? Tim, could you find out
> please?

I had assumed it was for the more straightforward reason that $future
uarch has a 64K L1D$, at least according to "The Internet" (TM):

https://en.wikichip.org/wiki/intel/core_i3/i3-8121u

It ought to be associativity that impacts the displacement itself, not
the LRU policy since you still end up with the L1D being updated.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1
  2018-06-05  1:25             ` [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1 Jon Masters
@ 2018-06-05  1:30               ` Linus Torvalds
  2018-06-05  7:10               ` Martin Pohlack
  1 sibling, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2018-06-05  1:30 UTC (permalink / raw)
  To: speck



On Mon, 4 Jun 2018, speck for Jon Masters wrote:
> 
> I had assumed it was for the more straightforward reason that $future
> uarch has a 64K L1D$, at least according to "The Internet" (TM):
> 
> https://en.wikichip.org/wiki/intel/core_i3/i3-8121u

No, that 64kB is just 2x32kB due to two cores.

It's still 8-way and 32kB per core, from what I can tell.

              Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1
  2018-06-05  1:25             ` [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1 Jon Masters
  2018-06-05  1:30               ` Linus Torvalds
@ 2018-06-05  7:10               ` Martin Pohlack
  1 sibling, 0 replies; 72+ messages in thread
From: Martin Pohlack @ 2018-06-05  7:10 UTC (permalink / raw)
  To: speck

On 05.06.2018 03:25, speck for Jon Masters wrote:
> On 06/04/2018 09:11 AM, speck for Konrad Rzeszutek Wilk wrote:
>> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>>> [resending as new message as the replay seems to have been lost on at
>>> least some mail paths]
>>>
>>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>>> prefetching which would benefit that...) and why a particularly
>>>>> obfuscated piece of magic is used for the 64byte strides.
>>>>
>>>> That is the only part I understood, :) the 4k strides ensure that the
>>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>>
>>> I think the reasoning is that you first want to populate the TLB for the
>>> whole flush array, then fence, to make sure TLB walks do not interfere
>>> with the actual flushing later, either for performance reasons or for
>>> preventing leakage of partial walk results.
>>>
>>> Not sure about the 64K, it likely is about the LRU implementation for L1
>>> replacement not being perfect (but pseudo LRU), so you need to flush
>>> more than the L1 size (32K) in software.  But I have also seen smaller
>>> recommendations for that (52K).
>>
>> Isn't Tim Chen from Intel on this mailing list? Tim, could you find out
>> please?
> 
> I had assumed it was for the more straightforward reason that $future
> uarch has a 64K L1D$, at least according to "The Internet" (TM):
> 
> https://en.wikichip.org/wiki/intel/core_i3/i3-8121u
> 
> It ought to be associativity that impacts the displacement itself, not
> the LRU policy since you still end up with the L1D being updated.

Associativity should already be well covered as the flush array is laid
out contiguously, so reading 32K should be enough assuming a perfect LRU
implementation and no interference from the page table walker.

Martin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-04 13:11           ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
  2018-06-04 17:59             ` [MODERATED] Encrypted Message Tim Chen
  2018-06-05  1:25             ` [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1 Jon Masters
@ 2018-06-05 23:34             ` Tim Chen
  2018-06-05 23:37               ` Tim Chen
  2 siblings, 1 reply; 72+ messages in thread
From: Tim Chen @ 2018-06-05 23:34 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 165 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

[-- Attachment #2: Type: text/plain, Size: 1779 bytes --]

On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>> [resending as new message as the replay seems to have been lost on at
>> least some mail paths]
>>
>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>> prefetching which would benefit that...) and why a particularly
>>>> obfuscated piece of magic is used for the 64byte strides.
>>>
>>> That is the only part I understood, :) the 4k strides ensure that the
>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>
>> I think the reasoning is that you first want to populate the TLB for the
>> whole flush array, then fence, to make sure TLB walks do not interfere
>> with the actual flushing later, either for performance reasons or for
>> preventing leakage of partial walk results.
>>
>> Not sure about the 64K, it likely is about the LRU implementation for L1
>> replacement not being perfect (but pseudo LRU), so you need to flush
>> more than the L1 size (32K) in software.  But I have also seen smaller
>> recommendations for that (52K).
> 

Had some discussions with other Intel folks.

Our recommendation is not to use the software sequence for L1 clear but
use wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE).
We expect that all affected systems will be receiving a ucode update
to provide L1 clearing capability.

Yes, the 4k stride is for getting TLB walks out of the way and
the 64kB replacement is to accommodate pseudo LRU.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-05 23:34             ` [MODERATED] Encrypted Message Tim Chen
@ 2018-06-05 23:37               ` Tim Chen
  2018-06-07 19:11                 ` Tim Chen
  0 siblings, 1 reply; 72+ messages in thread
From: Tim Chen @ 2018-06-05 23:37 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 165 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

[-- Attachment #2: Type: text/plain, Size: 1939 bytes --]

On 06/05/2018 04:34 PM, Tim Chen wrote:
> On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
>> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>>> [resending as new message as the replay seems to have been lost on at
>>> least some mail paths]
>>>
>>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>>> prefetching which would benefit that...) and why a particularly
>>>>> obfuscated piece of magic is used for the 64byte strides.
>>>>
>>>> That is the only part I understood, :) the 4k strides ensure that the
>>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>>
>>> I think the reasoning is that you first want to populate the TLB for the
>>> whole flush array, then fence, to make sure TLB walks do not interfere
>>> with the actual flushing later, either for performance reasons or for
>>> preventing leakage of partial walk results.
>>>
>>> Not sure about the 64K, it likely is about the LRU implementation for L1
>>> replacement not being perfect (but pseudo LRU), so you need to flush
>>> more than the L1 size (32K) in software.  But I have also seen smaller
>>> recommendations for that (52K).
>>
> 
> Had some discussions with other Intel folks.
> 
> Our recommendation is not to use the software sequence for L1 clear but
> use wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE).
> We expect that all affected systems will be receiving a ucode update
> to provide L1 clearing capability.
> 
> Yes, the 4k stride is for getting TLB walks out of the way and
> the 64kB replacement is to accommodate pseudo LRU.

I will try to see if I can get hold of the relevant documentation
on pseudo LRU.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
       [not found] ` <20180529194236.EDB8561100@crypto-ml.lab.linutronix.de>
@ 2018-06-06  0:34   ` Dave Hansen
  2018-06-06 14:15     ` Dave Hansen
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Hansen @ 2018-06-06  0:34 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 809 bytes --]

On 05/29/2018 12:42 PM, speck for Paolo Bonzini wrote:
>  	r = -ENOMEM;
> +	page = alloc_pages(GFP_ATOMIC, L1D_CACHE_ORDER);
> +	if (!page)
> +		goto out;
> +	empty_zero_pages = page_address(page);

There is also an Intel suggestion to have guard pages before and after
the L1D flush buffer.  As it stands, the prefetchers might pull data
into the cache from pages adjacent to the allocation you have there.

You can use vmalloc(), where we get (unmapped) guard pages already.  Or,
you can just oversize the allocation using:

	alloc_pages_exact(L1D_CACHE_ORDER * PAGE_SIZE + 2 * PAGE_SIZE)

and just point empty_zero_pages to the second page in the buffer:

	empty_zero_pages = page_address(page + 1);

I'd suggest the alloc_pages_exact() version.  It will chew up fewer TLB
entries.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-06-06  0:34   ` Dave Hansen
@ 2018-06-06 14:15     ` Dave Hansen
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Hansen @ 2018-06-06 14:15 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]

>  	r = -ENOMEM;
> +	page = alloc_pages(GFP_ATOMIC, L1D_CACHE_ORDER);
> +	if (!page)
> +		goto out;
> +	empty_zero_pages = page_address(page);
> +

Couple more things:

PAGE_SIZE<<L1D_CACHE_ORDER is is only 32k, right?  The assembly sequence
clears 64k IIRC:

> +void kvm_l1d_flush(void)
> +{
> +	asm volatile(
> +		"movq %0, %%rax\n\t"
> +		"leaq 65536(%0), %%rdx\n\t"
> +		"11: \n\t"
...

So we probably want the allocation to be something like:

	l1d_clear_buf_size = PAGE_SIZE << (L1D_CACHE_ORDER+1) +
			     2 * PAGE_SIZE;
	alloc_pages_exact(l1d_clear_buf_size, GFP_ZERO|GFP_ATOMIC);

Note that we need GFP_ZERO because we could theoretically get a page
that already had kernel secrets in it.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-05 23:37               ` Tim Chen
@ 2018-06-07 19:11                 ` Tim Chen
  2018-06-07 23:24                   ` [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1 Andi Kleen
  0 siblings, 1 reply; 72+ messages in thread
From: Tim Chen @ 2018-06-07 19:11 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 165 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

[-- Attachment #2: Type: text/plain, Size: 2489 bytes --]

On 06/05/2018 04:37 PM, Tim Chen wrote:
> On 06/05/2018 04:34 PM, Tim Chen wrote:
>> On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
>>> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>>>> [resending as new message as the replay seems to have been lost on at
>>>> least some mail paths]
>>>>
>>>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>>>> prefetching which would benefit that...) and why a particularly
>>>>>> obfuscated piece of magic is used for the 64byte strides.
>>>>>
>>>>> That is the only part I understood, :) the 4k strides ensure that the
>>>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>>>
>>>> I think the reasoning is that you first want to populate the TLB for the
>>>> whole flush array, then fence, to make sure TLB walks do not interfere
>>>> with the actual flushing later, either for performance reasons or for
>>>> preventing leakage of partial walk results.
>>>>
>>>> Not sure about the 64K, it likely is about the LRU implementation for L1
>>>> replacement not being perfect (but pseudo LRU), so you need to flush
>>>> more than the L1 size (32K) in software.  But I have also seen smaller
>>>> recommendations for that (52K).
>>>
>>
>> Had some discussions with other Intel folks.
>>
>> Our recommendation is not to use the software sequence for L1 clear but
>> use wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE).
>> We expect that all affected systems will be receiving a ucode update
>> to provide L1 clearing capability.
>>
>> Yes, the 4k stride is for getting TLB walks out of the way and
>> the 64kB replacement is to accommodate pseudo LRU.
> 
> I will try to see if I can get hold of the relevant documentation
> on pseudo LRU.
> 

The HW folks mentioned that if we have nothing from the flush buffer in
L1, then 32 KB would be sufficient (if we load miss for everything).

However, that's not the case. If some data from the flush buffer is
already in L1, it could protect an unrelated line that's considered
"near" by the LRU from getting flushed.  To make sure that does not
happen, we go through 64 KB of data to guarantee every line in L1 will
encounter a load miss and is flushed.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1
  2018-06-07 19:11                 ` Tim Chen
@ 2018-06-07 23:24                   ` Andi Kleen
  2018-06-08 16:29                     ` Thomas Gleixner
  0 siblings, 1 reply; 72+ messages in thread
From: Andi Kleen @ 2018-06-07 23:24 UTC (permalink / raw)
  To: speck

On Thu, Jun 07, 2018 at 12:11:21PM -0700, speck for Tim Chen wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
> Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

> On 06/05/2018 04:37 PM, Tim Chen wrote:
> > On 06/05/2018 04:34 PM, Tim Chen wrote:
> >> On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
> >>> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
> >>>> [resending as new message as the replay seems to have been lost on at
> >>>> least some mail paths]
> >>>>
> >>>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
> >>>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
> >>>>>> Other bits I don't understand are the 64k limit in the first place, why
> >>>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
> >>>>>> prefetching which would benefit that...) and why a particularly
> >>>>>> obfuscated piece of magic is used for the 64byte strides.
> >>>>>
> >>>>> That is the only part I understood, :) the 4k strides ensure that the
> >>>>> source data is in the TLB.  Why that is needed is still a mystery though.
> >>>>
> >>>> I think the reasoning is that you first want to populate the TLB for the
> >>>> whole flush array, then fence, to make sure TLB walks do not interfere
> >>>> with the actual flushing later, either for performance reasons or for
> >>>> preventing leakage of partial walk results.
> >>>>
> >>>> Not sure about the 64K, it likely is about the LRU implementation for L1
> >>>> replacement not being perfect (but pseudo LRU), so you need to flush
> >>>> more than the L1 size (32K) in software.  But I have also seen smaller
> >>>> recommendations for that (52K).
> >>>
> >>
> >> Had some discussions with other Intel folks.
> >>
> >> Our recommendation is not to use the software sequence for L1 clear but
> >> use wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE).
> >> We expect that all affected systems will be receiving a ucode update
> >> to provide L1 clearing capability.
> >>
> >> Yes, the 4k stride is for getting TLB walks out of the way and
> >> the 64kB replacement is to accommodate pseudo LRU.
> > 
> > I will try to see if I can get hold of the relevant documentation
> > on pseudo LRU.
> > 
> 
> The HW folks mentioned that if we have nothing from the flush buffer in
> L1, then 32 KB would be sufficient (if we load miss for everything).
> 
> However, that's not the case. If some data from the flush buffer is
> already in L1, it could protect an unrelated line that's considered
> "near" by the LRU from getting flushed.  To make sure that does not
> happen, we go through 64 KB of data to guarantee every line in L1 will
> encounter a load miss and is flushed.

Also the recommended mitigation is really to use the MSR write instead
of the magic software sequence. Perhaps it would be best to 
just remove the software sequence. Updated microcode is needed in
any case, it doesn't make sense to try to support partially updated systems.

-Andi

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1
  2018-06-07 23:24                   ` [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1 Andi Kleen
@ 2018-06-08 16:29                     ` Thomas Gleixner
  2018-06-08 17:51                       ` [MODERATED] " Josh Poimboeuf
  0 siblings, 1 reply; 72+ messages in thread
From: Thomas Gleixner @ 2018-06-08 16:29 UTC (permalink / raw)
  To: speck

On Thu, 7 Jun 2018, speck for Andi Kleen wrote:
> Also the recommended mitigation is really to use the MSR write instead
> of the magic software sequence. Perhaps it would be best to 
> just remove the software sequence. Updated microcode is needed in
> any case, it doesn't make sense to try to support partially updated systems.

ACK. That makes sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
       [not found] ` <20180529194240.5654A61109@crypto-ml.lab.linutronix.de>
@ 2018-06-08 17:49   ` Josh Poimboeuf
  2018-06-08 20:49     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 72+ messages in thread
From: Josh Poimboeuf @ 2018-06-08 17:49 UTC (permalink / raw)
  To: speck

On Tue, May 29, 2018 at 09:42:13PM +0200, speck for Paolo Bonzini wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> Subject: [PATCH 1/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> 
> This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
> fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".
> 
> "vmexit_l1d_flush=2" is simply doing an L1 cache flush on every vmexit.
> "vmexit_l1d_flush=1" is trying to avoid so on vmexits that are "confined"
> in the kind of code that they execute.  The idea is based on Intel's
> patches, but the final list of "confined" vmexits isn't quite.
> 
> Notably, L1 cache flushes are performed on EPT violations (which are
> basically KVM-level page faults), vmexits that involve the emulator,
> and on every KVM_RUN invocation (so each userspace exit).  However,
> most vmexits are considered safe.  I singled out the emulator because
> it may be a good target for other speculative execution-based threats,
> and the MMU because it can bring host page tables in the L1 cache.
> 
> The mitigation does not in any way try to do anything about hyperthreading;
> it is possible for a sibling thread to read data from the cache during a
> vmexit, before the host completes the flush, or to read data from the cache
> while a sibling runs.  This part of the work is not ready yet.
> 
> For now I'm leaving the default to "vmexit_l1d_flush=2", in case we need
> to push out the patches in an emergency embargo break, but I don't think
> it's the best setting.  The cost is up to 2.5x more expensive vmexits
> on Haswell processors, and 30% on Coffee Lake (for the latter, this is
> independent of whether microcode or the generic flush code are used).
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

I think we should report the L1 flushing mitigation value (i.e., 0, 1 or
2) in the 'l1tf' vulnerabilities sysfs file.

Also, it would be clearer to name them something like

  vmexit_l1d_flush=off
  vmexit_l1d_flush=confined
  vmexit_l1d_flush=on

etc.

-- 
Josh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1
  2018-06-08 16:29                     ` Thomas Gleixner
@ 2018-06-08 17:51                       ` Josh Poimboeuf
  2018-06-11 14:50                         ` Paolo Bonzini
  0 siblings, 1 reply; 72+ messages in thread
From: Josh Poimboeuf @ 2018-06-08 17:51 UTC (permalink / raw)
  To: speck

On Fri, Jun 08, 2018 at 06:29:16PM +0200, speck for Thomas Gleixner wrote:
> On Thu, 7 Jun 2018, speck for Andi Kleen wrote:
> > Also the recommended mitigation is really to use the MSR write instead
> > of the magic software sequence. Perhaps it would be best to 
> > just remove the software sequence. Updated microcode is needed in
> > any case, it doesn't make sense to try to support partially updated systems.
> 
> ACK. That makes sense.

In that case do we need plumbing to expose the L1 cache flush MSR to the
guest, for nested virt support?

-- 
Josh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-06-08 17:49   ` Josh Poimboeuf
@ 2018-06-08 20:49     ` Konrad Rzeszutek Wilk
  2018-06-08 22:13       ` Josh Poimboeuf
  0 siblings, 1 reply; 72+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-06-08 20:49 UTC (permalink / raw)
  To: speck

On Fri, Jun 08, 2018 at 12:49:04PM -0500, speck for Josh Poimboeuf wrote:
> On Tue, May 29, 2018 at 09:42:13PM +0200, speck for Paolo Bonzini wrote:
> > From: Paolo Bonzini <pbonzini@redhat.com>
> > Subject: [PATCH 1/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> > 
> > This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
> > fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".
> > 
> > "vmexit_l1d_flush=2" is simply doing an L1 cache flush on every vmexit.
> > "vmexit_l1d_flush=1" is trying to avoid so on vmexits that are "confined"
> > in the kind of code that they execute.  The idea is based on Intel's
> > patches, but the final list of "confined" vmexits isn't quite.
> > 
> > Notably, L1 cache flushes are performed on EPT violations (which are
> > basically KVM-level page faults), vmexits that involve the emulator,
> > and on every KVM_RUN invocation (so each userspace exit).  However,
> > most vmexits are considered safe.  I singled out the emulator because
> > it may be a good target for other speculative execution-based threats,
> > and the MMU because it can bring host page tables in the L1 cache.
> > 
> > The mitigation does not in any way try to do anything about hyperthreading;
> > it is possible for a sibling thread to read data from the cache during a
> > vmexit, before the host completes the flush, or to read data from the cache
> > while a sibling runs.  This part of the work is not ready yet.
> > 
> > For now I'm leaving the default to "vmexit_l1d_flush=2", in case we need
> > to push out the patches in an emergency embargo break, but I don't think
> > it's the best setting.  The cost is up to 2.5x more expensive vmexits
> > on Haswell processors, and 30% on Coffee Lake (for the latter, this is
> > independent of whether microcode or the generic flush code are used).
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> I think we should report the L1 flushing mitigation value (i.e., 0, 1 or
> 2) in the 'l1tf' vulnerabilities sysfs file.

But this is KVM module, not the generic code (bugs.c).

Or did you have in mind some sub-registration code so that the KVM module
can register its state of mind with bugs.c reporting?

Like 'L1D flush on VMENTER','IPI on VMEXIT', etc?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: [PATCH 1/2] L1TF KVM 1
  2018-06-08 20:49     ` Konrad Rzeszutek Wilk
@ 2018-06-08 22:13       ` Josh Poimboeuf
  0 siblings, 0 replies; 72+ messages in thread
From: Josh Poimboeuf @ 2018-06-08 22:13 UTC (permalink / raw)
  To: speck

On Fri, Jun 08, 2018 at 04:49:07PM -0400, speck for Konrad Rzeszutek Wilk wrote:
> On Fri, Jun 08, 2018 at 12:49:04PM -0500, speck for Josh Poimboeuf wrote:
> > On Tue, May 29, 2018 at 09:42:13PM +0200, speck for Paolo Bonzini wrote:
> > > From: Paolo Bonzini <pbonzini@redhat.com>
> > > Subject: [PATCH 1/2] kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> > > 
> > > This patch adds two mitigation modes for CVE-2018-3620, aka L1 terminal
> > > fault.  The two modes are "vmexit_l1d_flush=1" and "vmexit_l1d_flush=2".
> > > 
> > > "vmexit_l1d_flush=2" is simply doing an L1 cache flush on every vmexit.
> > > "vmexit_l1d_flush=1" is trying to avoid so on vmexits that are "confined"
> > > in the kind of code that they execute.  The idea is based on Intel's
> > > patches, but the final list of "confined" vmexits isn't quite.
> > > 
> > > Notably, L1 cache flushes are performed on EPT violations (which are
> > > basically KVM-level page faults), vmexits that involve the emulator,
> > > and on every KVM_RUN invocation (so each userspace exit).  However,
> > > most vmexits are considered safe.  I singled out the emulator because
> > > it may be a good target for other speculative execution-based threats,
> > > and the MMU because it can bring host page tables in the L1 cache.
> > > 
> > > The mitigation does not in any way try to do anything about hyperthreading;
> > > it is possible for a sibling thread to read data from the cache during a
> > > vmexit, before the host completes the flush, or to read data from the cache
> > > while a sibling runs.  This part of the work is not ready yet.
> > > 
> > > For now I'm leaving the default to "vmexit_l1d_flush=2", in case we need
> > > to push out the patches in an emergency embargo break, but I don't think
> > > it's the best setting.  The cost is up to 2.5x more expensive vmexits
> > > on Haswell processors, and 30% on Coffee Lake (for the latter, this is
> > > independent of whether microcode or the generic flush code are used).
> > > 
> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > I think we should report the L1 flushing mitigation value (i.e., 0, 1 or
> > 2) in the 'l1tf' vulnerabilities sysfs file.
> 
> But this is KVM module, not the generic code (bugs.c).
> 
> Or did you have in mind some sub-registration code so that the KVM module
> can register its state of mind with bugs.c reporting?
> 
> Like 'L1D flush on VMENTER','IPI on VMEXIT', etc?

Right, just something as simple as setting a global variable (which
lives in bugs.c) would probably be good enough.

-- 
Josh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1
  2018-06-08 17:51                       ` [MODERATED] " Josh Poimboeuf
@ 2018-06-11 14:50                         ` Paolo Bonzini
  0 siblings, 0 replies; 72+ messages in thread
From: Paolo Bonzini @ 2018-06-11 14:50 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On 08/06/2018 19:51, speck for Josh Poimboeuf wrote:
> On Fri, Jun 08, 2018 at 06:29:16PM +0200, speck for Thomas Gleixner wrote:
>> On Thu, 7 Jun 2018, speck for Andi Kleen wrote:
>>> Also the recommended mitigation is really to use the MSR write instead
>>> of the magic software sequence. Perhaps it would be best to 
>>> just remove the software sequence. Updated microcode is needed in
>>> any case, it doesn't make sense to try to support partially updated systems.
>> ACK. That makes sense.
> In that case do we need plumbing to expose the L1 cache flush MSR to the
> guest, for nested virt support?

No, because all L2->L1 exits go through L0 first.  So nested virt
systems are not vulnerable.

Is there a sanctioned way (in CPUID or elsewhere) to say the system does
not have the bug?

Paolo (back on track after moving to FPU land for a few days)


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-12 17:29 [MODERATED] FYI - Reading uncached memory Jon Masters
@ 2018-06-14 16:59 ` Tim Chen
  0 siblings, 0 replies; 72+ messages in thread
From: Tim Chen @ 2018-06-14 16:59 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 135 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: FYI - Reading uncached memory

[-- Attachment #2: Type: text/plain, Size: 586 bytes --]

On 06/12/2018 10:29 AM, speck for Jon Masters wrote:
> FYI Graz have been able to prove the Intel processors will allow
> speculative reads of /explicitly/ UC memory (e.g. marked in MTRR). I
> believe they actually use the QPI SAD table to determine what memory is
> speculation safe and what memory has side effects (i.e. if it's HA'able
> memory then it's deemed ok to rampantly speculate from it).
> 
> Just in case anyone thought UC was safe against attacks.
> 
> Jon.
> 

Thanks for forwarding the info.  Yes, the internal Intel team
is aware of this issue.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
  2019-01-14 19:20   ` [MODERATED] " Dave Hansen
@ 2019-01-14 23:39   ` Tim Chen
  1 sibling, 0 replies; 72+ messages in thread
From: Tim Chen @ 2019-01-14 23:39 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Andi Kleen <speck@linutronix.de>
Subject: Re: [PATCH v4 05/28] MDSv4 10

[-- Attachment #2: Type: text/plain, Size: 526 bytes --]


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50aa2aba69bd..b5a1bd4a1a46 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5980,6 +5980,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>  
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> +EXPORT_SYMBOL(sched_smt_present);

This export is not needed since sched_smt_present is not used in the patch series.
Only sched_smt_active() is used.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
@ 2019-01-15  1:05   ` Tim Chen
  0 siblings, 0 replies; 72+ messages in thread
From: Tim Chen @ 2019-01-15  1:05 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Andi Kleen <speck@linutronix.de>
Subject: Re: [PATCH v4 10/28] MDSv4 24

[-- Attachment #2: Type: text/plain, Size: 5059 bytes --]


On 1/11/19 5:29 PM, speck for Andi Kleen wrote:

> +Some CPUs can leave read or written data in internal buffers,
> +which then later might be sampled through side effects.
> +For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
> +
> +This can be avoided by explicitely clearing the CPU state.

s/explicitely/explicitly

> +
> +We trying to avoid leaking data between different processes,

Suggest changing the above phrase to the below:

CPU state clearing prevents leaking data between different processes,

...

> +Basic requirements and assumptions
> +----------------------------------
> +
> +Kernel addresses and kernel temporary data are not sensitive.
> +
> +User data is sensitive, but only for other processes.
> +
> +Kernel data is sensitive when it is cryptographic keys.

s/when it is/when it involves/

> +
> +Guidance for driver/subsystem developers
> +----------------------------------------
> +
> +When you touch user supplied data of *other* processes in system call
> +context add lazy_clear_cpu().
> +
> +For the cases below we care only about data from other processes.
> +Touching non cryptographic data from the current process is always allowed.
> +
> +Touching only pointers to user data is always allowed.
> +
> +When your interrupt does not touch user data directly consider marking

Add a "," between "directly" and "consider"

> +it with IRQF_NO_USER.
> +
> +When your tasklet does not touch user data directly consider marking

Add a "," between "directly" and "consider"

> +it with TASKLET_NO_USER using tasklet_init_flags/or
> +DECLARE_TASKLET*_NOUSER.
> +
> +When your timer does not touch user data mark it with TIMER_NO_USER.

Add a "," between "data" and "mark"

> +If it is a hrtimer mark it with HRTIMER_MODE_NO_USER.

Add a "," between "hrtimer" and "mark"

> +
> +When your irq poll handler does not touch user data, mark it
> +with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
> +
> +For networking code make sure to only touch user data through

Add a "," between "code" and "make"

> +skb_push/put/copy [add more], unless it is data from the current
> +process. If that is not ensured add lazy_clear_cpu or

Add a "," between "ensured" and "add"

> +lazy_clear_cpu_interrupt. When the non skb data access is only in a
> +hardware interrupt controlled by the driver, it can rely on not
> +setting IRQF_NO_USER for that interrupt.
> +
> +Any cryptographic code touching key data should use memzero_explicit
> +or kzfree.
> +
> +If your RCU callback touches user data add lazy_clear_cpu().
> +
> +These steps are currently only needed for code that runs on MDS affected
> +CPUs, which is currently only x86. But might be worth being prepared
> +if other architectures become affected too.
> +
> +Implementation details/assumptions
> +----------------------------------
> +
> +If a system call touches data it is for its own process, so does not

suggest rephrasing to 

If a system call touches data of its own process, cpu state does not

> +need to be cleared, because it has already access to it.
> +
> +When context switching we clear data, unless the context switch
> +is inside a process, or from/to idle. We also clear after any
> +context switches from kernel threads.
> +
> +Idle does not have sensitive data, except for in interrupts, which
> +are handled separately.
> +
> +Cryptographic keys inside the kernel should be protected.
> +We assume they use kzfree() or memzero_explicit() to clear
> +state, so these functions trigger a cpu clear.
> +
> +Hard interrupts, tasklets, timers which can run asynchronous are
> +assumed to touch random user data, unless they have been audited, and
> +marked with NO_USER flags.
> +
> +Most interrupt handlers for modern devices should not touch
> +user data because they rely on DMA and only manipulate
> +pointers. This needs auditing to confirm though.
> +
> +For softirqs we assume that if they touch user data they use

Add "," between "data" and "they"

...

> +Technically we would only need to do this if the BPF program
> +contains conditional branches and loads dominated by them, but
> +let's assume that near all do.
s/near/nealy/

> +
> +This could be further optimized by allowing callers that do
> +a lot of individual BPF runs and are sure they don't touch
> +other user's data inbetween to do the clear only once
> +at the beginning. 

Suggest breaking the above sentence.  It is quite difficult to read.

> We can add such optimizations later based on
> +profile data.
> +
> +Virtualization
> +--------------
> +
> +When entering a guest in KVM we clear to avoid any leakage to a guest.
... we clear CPU state to avoid ....

> +Normally this is done implicitely as part of the L1TF mitigation.

s/implicitely/implicitly/

> +It relies on this being enabled. It also uses the "fast exit"
> +optimization that only clears if an interrupt or context switch
> +happened.
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-14 19:20   ` [MODERATED] " Dave Hansen
@ 2019-01-18  7:33     ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-01-18  7:33 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 122 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Dave Hansen <speck@linutronix.de>
Subject: Re: [PATCH v4 05/28] MDSv4 10

[-- Attachment #2: Type: text/plain, Size: 1328 bytes --]

On 1/14/19 2:20 PM, speck for Dave Hansen wrote:

> On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
>> When entering idle the internal state of the current CPU might
>> become visible to the thread sibling because the CPU "frees" some
>> internal resources.
> 
> Is there some documentation somewhere about what "idle" means here?  It
> looks like MWAIT and HLT certainly count, but is there anything else?

We know power state transitions in addition can cause the peer to
dynamically sleep or wake up. MWAIT was the main example I got out of
Intel for how you'd explicitly cause a thread to be deallocated.

When Andi is talking about "frees" above he means (for example) the
dynamic allocation/deallocation of store buffer entries as threads come
and go - e.g. in Skylake there are 56 entries in a distributed store
buffer that splits into 2x28. I am not aware of fill buffer behavior
changing as threads come and go, and this isn't documented AFAICS.

I've been wondering whether we want a bit more detail in the docs. I
spent a /lot/ of time last week going through all of Intel's patents in
this area, which really help understand it. If folks feel we could do
with a bit more meaty summary, I can try to suggest something.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-08 10:53         ` [MODERATED] [RFC][PATCH] performance walnuts Peter Zijlstra
@ 2019-02-15 23:45           ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-02-15 23:45 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 132 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Peter Zijlstra <speck@linutronix.de>
Subject: Re: [RFC][PATCH] performance walnuts

[-- Attachment #2: Type: text/plain, Size: 944 bytes --]

On 2/8/19 5:53 AM, speck for Peter Zijlstra wrote:
> +static void intel_set_tfa(struct cpu_hw_events *cpuc, bool on)
> +{
> +	u64 val = MSR_TFA_RTM_FORCE_ABORT * on;
> +
> +	if (cpuc->tfa_shadow != val) {
> +		cpuc->tfa_shadow = val;
> +		wrmsrl(MSR_TSX_FORCE_ABORT, val);
> +	}
> +}

Ok let me ask a stupid question.

This MSR is exposed on a given core. What's the impact (if any) on
*other* cores that might be using TSX? For example, suppose I'm running
an application using RTM on one core while another application on
another core begins profiling. What impact does the impact of this MSR
write have on other cores? (Architecturally).

I'm assuming the implementation of HLE relies on whatever you're doing
fitting into the local core's cache and you just abort on any snoop,
etc. so it ought to be fairly self contained, but I want to know.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-19 12:44 [patch 0/8] MDS basics 0 Thomas Gleixner
@ 2019-02-21 16:14 ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-02-21 16:14 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 125 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch 0/8] MDS basics 0

[-- Attachment #2: Type: text/plain, Size: 304 bytes --]

Hi Thomas,

Just a note on testing. I built a few Coffelake client systems for Red
Hat using the 8086K anniversary processor for which we have test ucode.
I will build and test these patches and ask the RH perf team to test.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-20 17:10   ` [MODERATED] " mark gross
@ 2019-02-21 19:26     ` Tim Chen
  0 siblings, 0 replies; 72+ messages in thread
From: Tim Chen @ 2019-02-21 19:26 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 135 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for mark gross <speck@linutronix.de>
Subject: Re: [patch V2 04/10] MDS basics+ 4

[-- Attachment #2: Type: text/plain, Size: 829 bytes --]

On 2/20/19 9:10 AM, speck for mark gross wrote:

>> +
>> +      - KGBD

s/KGBD/KGDB

>> +
>> +        If the kernel debugger is accessible by an unpriviledged attacker,
>> +        then the NMI handler is the least of the problems.
>> +

...

> 
> However; if I'm being pedantic, the attacker not having controlability aspect
> of your argument can apply to most aspects of the MDS vulnerability.  I think
> that's why its name uses "data sampling".  Also, I need to ask the chip heads
> about if this list of NMI's is complete and can be expected to stay that way
> across processor and platfrom generations.
> 
> --mark
> 


I don't think any of the code paths listed touches any user data.  So even
if an attacker have some means to control NMI, he won't get any useful data.

Thanks.

Tim 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-21 23:44 ` [patch V3 4/9] MDS basics 4 Thomas Gleixner
@ 2019-02-22  7:45   ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-02-22  7:45 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 128 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V3 4/9] MDS basics 4

[-- Attachment #2: Type: text/plain, Size: 653 bytes --]

On 2/21/19 6:44 PM, speck for Thomas Gleixner wrote:
> +#include <asm/segment.h>
> +
> +/**
> + * mds_clear_cpu_buffers - Mitigation for MDS vulnerability
> + *
> + * This uses the otherwise unused and obsolete VERW instruction in
> + * combination with microcode which triggers a CPU buffer flush when the
> + * instruction is executed.
> + */
> +static inline void mds_clear_cpu_buffers(void)
> +{
> +	static const u16 ds = __KERNEL_DS;

Dunno if it's worth documenting that using a specifically valid segment
is faster than a zero selector according to Intel.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-25 15:49       ` Greg KH
@ 2019-02-25 15:52         ` Jon Masters
  2019-02-25 16:00           ` [MODERATED] " Greg KH
  0 siblings, 1 reply; 72+ messages in thread
From: Jon Masters @ 2019-02-25 15:52 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 115 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: [PATCH v6 31/43] MDSv6

[-- Attachment #2: Type: text/plain, Size: 1032 bytes --]

On 2/25/19 10:49 AM, speck for Greg KH wrote:
> On Mon, Feb 25, 2019 at 07:34:11AM -0800, speck for Andi Kleen wrote:


>> However I will probably not be able to write a detailed
>> description for each of the interrupt handlers changed because
>> there are just too many.
> 
> Then how do you expect each subsystem / driver author to know if this is
> an acceptable change or not?  How do you expect to educate driver
> authors to have them determine if they need to do this on their new
> drivers or not?  Are you going to hand-audit each new driver that gets
> added to the kernel for forever?
> 
> Without this type of information, this seems like a futile exercise.

Forgive me if I'm being too cautious here, but it seems to make most
sense to have the basic MDS infrastructure in place at unembargo. Unless
it's very clear how the auto stuff can be safe, and the audit
comprehensive, I wonder if that shouldn't just be done after.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-25 16:00           ` [MODERATED] " Greg KH
@ 2019-02-25 16:19             ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-02-25 16:19 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 110 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: Encrypted Message

[-- Attachment #2: Type: text/plain, Size: 1592 bytes --]

On 2/25/19 11:00 AM, speck for Greg KH wrote:
> On Mon, Feb 25, 2019 at 10:52:30AM -0500, speck for Jon Masters wrote:
>> From: Jon Masters <jcm@redhat.com>
>> To: speck for Greg KH <speck@linutronix.de>
>> Subject: Re: [PATCH v6 31/43] MDSv6
> 
>> On 2/25/19 10:49 AM, speck for Greg KH wrote:
>>> On Mon, Feb 25, 2019 at 07:34:11AM -0800, speck for Andi Kleen wrote:
>>
>>
>>>> However I will probably not be able to write a detailed
>>>> description for each of the interrupt handlers changed because
>>>> there are just too many.
>>>
>>> Then how do you expect each subsystem / driver author to know if this is
>>> an acceptable change or not?  How do you expect to educate driver
>>> authors to have them determine if they need to do this on their new
>>> drivers or not?  Are you going to hand-audit each new driver that gets
>>> added to the kernel for forever?
>>>
>>> Without this type of information, this seems like a futile exercise.
>>
>> Forgive me if I'm being too cautious here, but it seems to make most
>> sense to have the basic MDS infrastructure in place at unembargo. Unless
>> it's very clear how the auto stuff can be safe, and the audit
>> comprehensive, I wonder if that shouldn't just be done after.
> 
> I thought that was what Thomas's patchset provided and is what was
> alluded to in patch 00/43 of this series.

Indeed. I'm asking whether we're trying to figure out the "auto" stuff
as well before unembargo or is the other discussion just for planning?

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-25 16:30   ` [MODERATED] " Greg KH
@ 2019-02-25 16:41     ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-02-25 16:41 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 115 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: [PATCH v6 10/43] MDSv6

[-- Attachment #2: Type: text/plain, Size: 411 bytes --]

On 2/25/19 11:30 AM, speck for Greg KH wrote:

>> +BPF could attack the rest of the kernel if it can successfully
>> +measure side channel side effects.
> 
> Can it do such a measurement?

The researchers involved in MDS are actively working on an exploit using
BPF as well, so I expect we'll know soon. My assumption is "yes".

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-26 14:19   ` [MODERATED] " Josh Poimboeuf
@ 2019-03-01 20:58     ` Jon Masters
  2019-03-01 22:14       ` Jon Masters
  0 siblings, 1 reply; 72+ messages in thread
From: Jon Masters @ 2019-03-01 20:58 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 164 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()

[-- Attachment #2: Type: text/plain, Size: 2764 bytes --]

On 2/26/19 9:19 AM, speck for Josh Poimboeuf wrote:

> On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
>> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
>> +L1 miss situations and to hold data which is returned or sent in response
>> +to a memory or I/O operation. Fill buffers can forward data to a load
>> +operation and also write data to the cache. When the fill buffer is
>> +deallocated it can retain the stale data of the preceding operations which
>> +can then be forwarded to a faulting or assisting load operation, which can
>> +be exploited under certain conditions. Fill buffers are shared between
>> +Hyper-Threads so cross thread leakage is possible.

The fill buffers sit opposite the L1D$ and participate in coherency
directly. They supply data directly to the load store units. Here's the
internal summary I wrote (feel free to use any of it that is useful):

"Intel processors utilize fill buffers to perform loads of data when a
miss occurs in the Level 1 data cache. The fill buffer allows the
processor to implement a non-blocking cache, continuing with other
operations while the necessary cache data “line” is loaded from a higher
level cache or from memory. It also allows the result of the fill to be
forwarded directly to the EU (Execution Unit) requiring the load,
without waiting for it to be written into the L1 Data Cache.

A load operation is not decoupled in the same way that a store is, but
it does involve an AGU (Address Generation Unit) operation. If the AGU
generates a fault (#PF, etc.) or an assist (A/D bits) then the classical
Intel design would block the load and later reissue it. In contemporary
designs, it instead allows subsequent speculation operations to
temporarily see a forwarded data value from the fill buffer slot prior
to the load actually taking place. Thus it is possible to read data that
was recently accessed by another thread, if the fill buffer entry is not
reused.

It is this attack that allows cross-thread SMT leakage and breaks HT
without recourse other than to disable it or to implement core
scheduling in the Linux kernel.

Variants of this include loads that cross cache or page boundaries due
to further optimizations in Intel’s implementation. For example, Intel
incorporate logic to guess at address generation prior to determining
whether it crosses such a boundary (covered in US5335333A) and will
forward this to the TLB/load logic prior to resolving the full address.
They will retry the load by re-issuing uops in the case of a cross
cacheline/page boundary but in that case will leak state as well."

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 20:58     ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-01 22:14       ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-01 22:14 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 161 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()

[-- Attachment #2: Type: text/plain, Size: 3426 bytes --]

On 3/1/19 3:58 PM, speck for Jon Masters wrote:
> On 2/26/19 9:19 AM, speck for Josh Poimboeuf wrote:
> 
>> On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
>>> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
>>> +L1 miss situations and to hold data which is returned or sent in response
>>> +to a memory or I/O operation. Fill buffers can forward data to a load
>>> +operation and also write data to the cache. When the fill buffer is
>>> +deallocated it can retain the stale data of the preceding operations which
>>> +can then be forwarded to a faulting or assisting load operation, which can
>>> +be exploited under certain conditions. Fill buffers are shared between
>>> +Hyper-Threads so cross thread leakage is possible.
> 
> The fill buffers sit opposite the L1D$ and participate in coherency
> directly. They supply data directly to the load store units. Here's the
> internal summary I wrote (feel free to use any of it that is useful):
> 
> "Intel processors utilize fill buffers to perform loads of data when a
> miss occurs in the Level 1 data cache. The fill buffer allows the
> processor to implement a non-blocking cache, continuing with other
> operations while the necessary cache data “line” is loaded from a higher
> level cache or from memory. It also allows the result of the fill to be
> forwarded directly to the EU (Execution Unit) requiring the load,
> without waiting for it to be written into the L1 Data Cache.
> 
> A load operation is not decoupled in the same way that a store is, but
> it does involve an AGU (Address Generation Unit) operation. If the AGU
> generates a fault (#PF, etc.) or an assist (A/D bits) then the classical
> Intel design would block the load and later reissue it. In contemporary
> designs, it instead allows subsequent speculation operations to
> temporarily see a forwarded data value from the fill buffer slot prior
> to the load actually taking place. Thus it is possible to read data that
> was recently accessed by another thread, if the fill buffer entry is not
> reused.
> 
> It is this attack that allows cross-thread SMT leakage and breaks HT
> without recourse other than to disable it or to implement core
> scheduling in the Linux kernel.
> 
> Variants of this include loads that cross cache or page boundaries due
> to further optimizations in Intel’s implementation. For example, Intel
> incorporate logic to guess at address generation prior to determining
> whether it crosses such a boundary (covered in US5335333A) and will
> forward this to the TLB/load logic prior to resolving the full address.
> They will retry the load by re-issuing uops in the case of a cross
> cacheline/page boundary but in that case will leak state as well."

Btw, I've various reproducers here that I'm happy to share if useful
with the right folks. Thomas and Linus should already have my IFU one
for later testing of that, I've also e.g. an FBBF. Currently it just
spews whatever it sees from the other threads, but in the next few days
I'll have it cleaned up to send/receive specific messages - then can
just wrap it with a bow so it can print yes/no vulnerable.

Ping if you have a need for a repro (keybase/email) and I'll go through
our process for sharing as appropriate.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
@ 2019-03-04  3:55   ` Jon Masters
  2019-03-04  7:30   ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
  1 sibling, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  3:55 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 1/4] 1

[-- Attachment #2: Type: text/plain, Size: 1069 bytes --]

On 3/3/19 8:23 PM, speck for Josh Poimboeuf wrote:

> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index e11654f93e71..0c71ab0d57e3 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -221,6 +221,7 @@ static void x86_amd_ssb_disable(void)
>  
>  /* Default mitigation for L1TF-affected CPUs */
>  static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_FULL;
> +static bool mds_nosmt __ro_after_init = false;
>  
>  static const char * const mds_strings[] = {
>  	[MDS_MITIGATION_OFF]	= "Vulnerable",
> @@ -238,8 +239,13 @@ static void mds_select_mitigation(void)
>  	if (mds_mitigation == MDS_MITIGATION_FULL) {
>  		if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
>  			mds_mitigation = MDS_MITIGATION_VMWERV;
> +
>  		static_branch_enable(&mds_user_clear);
> +
> +		if (mds_nosmt && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
> +			cpu_smt_disable(false);

Is there some logic missing here to disable SMT?

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  1:24 ` [MODERATED] [PATCH RFC 3/4] 3 Josh Poimboeuf
@ 2019-03-04  3:58   ` Jon Masters
  2019-03-04 17:17     ` [MODERATED] " Josh Poimboeuf
  0 siblings, 1 reply; 72+ messages in thread
From: Jon Masters @ 2019-03-04  3:58 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 3/4] 3

[-- Attachment #2: Type: text/plain, Size: 445 bytes --]

On 3/3/19 8:24 PM, speck for Josh Poimboeuf wrote:

> +		if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
> +			pr_warn_once(MDS_MSG_SMT);

It's never fully safe to use SMT. I get that if we only had MSBDS then
it's unlikely we'll hit the e.g. power state change cases needed to
exploit it but I think it would be prudent to display something anyway?

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  1:25 ` [MODERATED] [PATCH RFC 4/4] 4 Josh Poimboeuf
@ 2019-03-04  4:07   ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  4:07 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 4/4] 4

[-- Attachment #2: Type: text/plain, Size: 1461 bytes --]

On 3/3/19 8:25 PM, speck for Josh Poimboeuf wrote:
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> Subject: [PATCH RFC 4/4] x86/speculation: Add 'cpu_spec_mitigations=' cmdline
>  options
> 
> Keeping track of the number of mitigations for all the CPU speculation
> bugs has become overwhelming for many users.  It's getting more and more
> complicated to decide what mitigations are needed for a given
> architecture.
> 
> Most users fall into a few basic categories:
> 
> - want all mitigations off;
> 
> - want all reasonable mitigations on, with SMT enabled even if it's
>   vulnerable; or
> 
> - want all reasonable mitigations on, with SMT disabled if vulnerable.
> 
> Define a set of curated, arch-independent options, each of which is an
> aggregation of existing options:
> 
> - cpu_spec_mitigations=off: Disable all mitigations.
> 
> - cpu_spec_mitigations=auto: [default] Enable all the default mitigations,
>   but leave SMT enabled, even if it's vulnerable.
> 
> - cpu_spec_mitigations=auto,nosmt: Enable all the default mitigations,
>   disabling SMT if needed by a mitigation.
> 
> See the documentation for more details.

Looks good. There's an effort to upstream mitigation controls for the
arm64 but that's not in place yet. They'll want to wire that up later. I
actually had missed the s390x etokens work so that was fun to see here.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
                   ` (3 preceding siblings ...)
  2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
@ 2019-03-04  5:30 ` Jon Masters
  4 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  5:30 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 00/14] MDS basics 0

[-- Attachment #2: Type: text/plain, Size: 1408 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> Changes vs. V5:
> 
>   - Fix tools/ build (Josh)
> 
>   - Dropped the AIRMONT_MID change as it needs confirmation from Intel
> 
>   - Made the consolidated whitelist more readable and correct
> 
>   - Added the MSBDS only quirk for XEON PHI, made the idle flush
>     depend on it and updated the sysfs output accordingly.
> 
>   - Fixed the protection matrix in the admin documentation and clarified
>     the SMT situation vs. MSBDS only.
> 
>   - Updated the KVM/VMX changelog.
> 
> Delta patch against V5 below.
> 
> Available from git:
> 
>    cvs.ou.linutronix.de:linux/speck/linux WIP.mds
> 
> The linux-4.20.y, linux-4.19.y and linux-4.14.y branches are updated as
> well and contain the untested backports of the pile for reference.
> 
> I'll send git bundles of the pile as well.

Tested on Coffeelake with updated ucode successfully:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
stepping        : 10
microcode       : 0xae

[jcm@stephen ~]$ dmesg|grep MDS
[    1.633165] MDS: Mitigation: Clear CPU buffers

[jcm@stephen ~]$ cat /sys/devices/system/cpu/vulnerabilities/mds
Mitigation: Clear CPU buffers; SMT vulnerable

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
@ 2019-03-04  5:47   ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  5:47 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 131 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 12/14] MDS basics 12

[-- Attachment #2: Type: text/plain, Size: 1553 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:

> Subject: [patch V6 12/14] x86/speculation/mds: Add mitigation mode VMWERV
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> In virtualized environments it can happen that the host has the microcode
> update which utilizes the VERW instruction to clear CPU buffers, but the
> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> to guests.
> 
> Introduce an internal mitigation mode VWWERV which enables the invocation
> of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
> system has no updated microcode this results in a pointless execution of
> the VERW instruction wasting a few CPU cycles. If the microcode is updated,
> but not exposed to a guest then the CPU buffers will be cleared.
> 
> That said: Virtual Machines Will Eventually Receive Vaccine

The effect of this patch, currently, is that a (bare metal) machine
without updated ucode will print the following:

[    1.576602] MDS: Vulnerable: Clear CPU buffers attempted, no microcode

The intention of the patch is to say "hey, you might be on a VM, so
we'll try anyway in case we didn't get told you had MD_CLEAR". But the
effect on bare metal might be ambiguous. It's reasonable (for someone
else) to assume we might be using a software sequence to try flushing.

Perhaps the wording should convey something like:

"MDS: Vulnerable: Clear CPU buffers may not work, no microcode"

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 ` [patch V6 06/14] MDS basics 6 Thomas Gleixner
@ 2019-03-04  6:28   ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  6:28 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 06/14] MDS basics 6

[-- Attachment #2: Type: text/plain, Size: 1195 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> Provide a inline function with the assembly magic. The argument of the VERW
> instruction must be a memory operand as documented:
> 
>   "MD_CLEAR enumerates that the memory-operand variant of VERW (for
>    example, VERW m16) has been extended to also overwrite buffers affected
>    by MDS. This buffer overwriting functionality is not guaranteed for the
>    register operand variant of VERW."
> 
> Documentation also recommends to use a writable data segment selector:
> 
>   "The buffer overwriting occurs regardless of the result of the VERW
>    permission check, as well as when the selector is null or causes a
>    descriptor load segment violation. However, for lowest latency we
>    recommend using a selector that indicates a valid writable data
>    segment."

Note that we raised this again with Intel last week amid Andrew's
results and they are going to get back to us if this guidance changes as
a result of further measurements on their end. It's a few cycles
difference in the Coffeelake case, but it could always be higher.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
@ 2019-03-04  6:45   ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  6:45 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 131 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 10/14] MDS basics 10

[-- Attachment #2: Type: text/plain, Size: 306 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:

> +	/*
> +	 * Enable the idle clearing on CPUs which are affected only by
> +	 * MDBDS and not any other MDS variant. The other variants cannot
           ^^^^^
           MSBDS


-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
@ 2019-03-04  6:57   ` Jon Masters
  2019-03-04  7:06     ` Jon Masters
  2019-03-05 15:34     ` Thomas Gleixner
  0 siblings, 2 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  6:57 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 08/14] MDS basics 8

[-- Attachment #2: Type: text/plain, Size: 491 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>  		vmx_l1d_flush(vcpu);
> +	else if (static_branch_unlikely(&mds_user_clear))
> +		mds_clear_cpu_buffers();

Does this cover the case where we have older ucode installed that does
L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
logic handling this but wanted to call it out.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  6:57   ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-04  7:06     ` Jon Masters
  2019-03-04  8:12       ` Jon Masters
  2019-03-05 15:34     ` Thomas Gleixner
  1 sibling, 1 reply; 72+ messages in thread
From: Jon Masters @ 2019-03-04  7:06 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 126 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V6 08/14] MDS basics 8

[-- Attachment #2: Type: text/plain, Size: 877 bytes --]

On 3/4/19 1:57 AM, speck for Jon Masters wrote:
> On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>>  		vmx_l1d_flush(vcpu);
>> +	else if (static_branch_unlikely(&mds_user_clear))
>> +		mds_clear_cpu_buffers();
> 
> Does this cover the case where we have older ucode installed that does
> L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
> logic handling this but wanted to call it out.

Aside from the above question, I've reviewed all of the patches
extensively at this point. Feel free to add a Reviewed-by or Tested-by
according to your preference. I've a bunch of further tests running,
including on AMD platforms just so to check nothing broke with those
platforms that are not susceptible to MDS.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  7:30   ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
@ 2019-03-04  7:45     ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  7:45 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 110 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: [PATCH RFC 1/4] 1

[-- Attachment #2: Type: text/plain, Size: 1867 bytes --]

On 3/4/19 2:30 AM, speck for Greg KH wrote:
> On Sun, Mar 03, 2019 at 07:23:22PM -0600, speck for Josh Poimboeuf wrote:
>> From: Josh Poimboeuf <jpoimboe@redhat.com>
>> Subject: [PATCH RFC 1/4] x86/speculation/mds: Add mds=full,nosmt cmdline
>>  option
>>
>> Add the mds=full,nosmt cmdline option.  This is like mds=full, but with
>> SMT disabled if the CPU is vulnerable.
>>
>> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
>> ---
>>  Documentation/admin-guide/hw-vuln/mds.rst       |  3 +++
>>  Documentation/admin-guide/kernel-parameters.txt |  6 ++++--
>>  arch/x86/kernel/cpu/bugs.c                      | 10 ++++++++++
>>  3 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
>> index 1de29d28903d..244ab47d1fb3 100644
>> --- a/Documentation/admin-guide/hw-vuln/mds.rst
>> +++ b/Documentation/admin-guide/hw-vuln/mds.rst
>> @@ -260,6 +260,9 @@ time with the option "mds=". The valid arguments for this option are:
>>  
>>  		It does not automatically disable SMT.
>>  
>> +  full,nosmt	The same as mds=full, with SMT disabled on vulnerable
>> +		CPUs.  This is the complete mitigation.
> 
> While I understand the intention, the number of different combinations
> we are "offering" to userspace here is huge, and everyone is going to be
> confused as to what to do.  If we really think/say that SMT is a major
> issue for this, why don't we just have "full" disable SMT?

Frankly, it ought to for safety (can't be made safe). The reason cited
for not doing so (Thomas and Linus can speak up on this part) was
upgrades vs new installs. The concern was not to break existing folks by
losing half their logical CPU count when upgrading a kernel.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  7:06     ` Jon Masters
@ 2019-03-04  8:12       ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-04  8:12 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 126 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V6 08/14] MDS basics 8

[-- Attachment #2: Type: text/plain, Size: 1075 bytes --]

On 3/4/19 2:06 AM, speck for Jon Masters wrote:
> On 3/4/19 1:57 AM, speck for Jon Masters wrote:
>> On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>>>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>>>  		vmx_l1d_flush(vcpu);
>>> +	else if (static_branch_unlikely(&mds_user_clear))
>>> +		mds_clear_cpu_buffers();
>>
>> Does this cover the case where we have older ucode installed that does
>> L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
>> logic handling this but wanted to call it out.
> 
> Aside from the above question, I've reviewed all of the patches
> extensively at this point. Feel free to add a Reviewed-by or Tested-by
> according to your preference. I've a bunch of further tests running,
> including on AMD platforms just so to check nothing broke with those
> platforms that are not susceptible to MDS.

Running fine on AMD platform here and reports correctly:

$ cat /sys/devices/system/cpu/vulnerabilities/mds
Not affected

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-05 16:43 [MODERATED] Starting to go public? Linus Torvalds
  2019-03-05 17:02 ` [MODERATED] " Andrew Cooper
@ 2019-03-05 17:10 ` Jon Masters
  1 sibling, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-05 17:10 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 135 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Linus Torvalds <speck@linutronix.de>
Subject: NOT PUBLIC - Re: Starting to go public?

[-- Attachment #2: Type: text/plain, Size: 796 bytes --]

On 3/5/19 11:43 AM, speck for Linus Torvalds wrote:
> Looks like the papers are starting to leak:
> 
>    https://arxiv.org/pdf/1903.00446.pdf
> 
> yes, yes, a lot of the attack seems to be about rowhammer, but the
> "spolier" part looks like MDS.

It's not but it is close to finding PSF behavior. The thing they found
is described separately in one of the original Intel store patent. So we
are at risk but should not panic.

I've spoken with several researchers sitting on MDS papers and confirmed
that they are NOT concerned at this stage. Of course everyone is
carefully watching and that's why we need to have contingency. People
will start looking in this area (I know of three teams doing so) now.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-05 22:31     ` Andrew Cooper
@ 2019-03-06 16:18       ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-06 16:18 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 121 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Andrew Cooper <speck@linutronix.de>
Subject: Re: Starting to go public?

[-- Attachment #2: Type: text/plain, Size: 1380 bytes --]

On 3/5/19 5:31 PM, speck for Andrew Cooper wrote:
> On 05/03/2019 20:36, speck for Jiri Kosina wrote:
>> On Tue, 5 Mar 2019, speck for Andrew Cooper wrote:
>>
>>>> Looks like the papers are starting to leak:
>>>>
>>>>    https://arxiv.org/pdf/1903.00446.pdf
>>>>
>>>> yes, yes, a lot of the attack seems to be about rowhammer, but the
>>>> "spolier" part looks like MDS.
>>> So Intel was aware of that paper, but wasn't expecting it to go public
>>> today.
>>>
>>> =46rom their point of view, it is a traditional timing sidechannel on a
>>> piece of the pipeline (which happens to be component which exists for
>>> speculative memory disambiguation).
>>>
>>> There are no proposed changes to the MDS timeline at this point.
>> So this is not the paper that caused the panic fearing that PSF might leak 
>> earlier than the rest of the issues in mid-february (which few days later 
>> Intel claimed to have succesfully negotiated with the researches not to 
>> publish before the CRD)?
> 
> Correct.
> 
> The incident you are referring to is a researcher who definitely found
> PSF, contacted Intel and was initially displeased at the proposed embargo.

Indeed. There are at least three different teams with papers that read
on MDS, and all of them are holding to the embargo.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-05 15:34     ` Thomas Gleixner
@ 2019-03-06 16:21       ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-06 16:21 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 118 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: Encrypted Message

[-- Attachment #2: Type: text/plain, Size: 990 bytes --]

On 3/5/19 10:34 AM, speck for Thomas Gleixner wrote:
> On Mon, 4 Mar 2019, speck for Jon Masters wrote:
> 
>> On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>>>       if (static_branch_unlikely(&vmx_l1d_should_flush))
>>>               vmx_l1d_flush(vcpu);
>>> +     else if (static_branch_unlikely(&mds_user_clear))
>>> +             mds_clear_cpu_buffers();
>>
>> Does this cover the case where we have older ucode installed that does
>> L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
>> logic handling this but wanted to call it out.
> 
> If no updated microcode is available then it's pretty irrelevant which code
> path you take. None of them will mitigate MDS.

You're right. My fear was we'd have some microcode that mitigated L1D
without implied MD clear but also did MDS. I was incorrect - all ucode
that will be publicly released will have both properties.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04 17:17     ` [MODERATED] " Josh Poimboeuf
@ 2019-03-06 16:22       ` Jon Masters
  0 siblings, 0 replies; 72+ messages in thread
From: Jon Masters @ 2019-03-06 16:22 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: Encrypted Message

[-- Attachment #2: Type: text/plain, Size: 778 bytes --]

On 3/4/19 12:17 PM, speck for Josh Poimboeuf wrote:
> On Sun, Mar 03, 2019 at 10:58:01PM -0500, speck for Jon Masters wrote:
> 
>> On 3/3/19 8:24 PM, speck for Josh Poimboeuf wrote:
>>
>>> +		if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
>>> +			pr_warn_once(MDS_MSG_SMT);
>>
>> It's never fully safe to use SMT. I get that if we only had MSBDS then
>> it's unlikely we'll hit the e.g. power state change cases needed to
>> exploit it but I think it would be prudent to display something anyway?
> 
> My understanding is that the idle state changes are mitigated elsewhere
> in the MDS patches, so it should be safe in theory.

Looked at it again. Agree. Sorry about that.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2019-03-06 16:22 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-29 19:42 [MODERATED] [PATCH 0/2] L1TF KVM 0 Paolo Bonzini
2018-05-29 19:42 ` [MODERATED] [PATCH 1/2] L1TF KVM 1 Paolo Bonzini
2018-05-29 19:42 ` [MODERATED] [PATCH 2/2] L1TF KVM 2 Paolo Bonzini
     [not found] ` <20180529194240.7F1336110A@crypto-ml.lab.linutronix.de>
2018-05-29 22:49   ` [PATCH 1/2] L1TF KVM 1 Thomas Gleixner
2018-05-29 23:54     ` [MODERATED] " Andrew Cooper
2018-05-30  9:01       ` Paolo Bonzini
2018-05-30 11:58         ` Martin Pohlack
2018-05-30 12:25           ` Thomas Gleixner
2018-05-30 14:31             ` Thomas Gleixner
2018-06-04  8:24         ` [MODERATED] " Martin Pohlack
2018-06-04 13:11           ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
2018-06-04 17:59             ` [MODERATED] Encrypted Message Tim Chen
2018-06-05  1:25             ` [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1 Jon Masters
2018-06-05  1:30               ` Linus Torvalds
2018-06-05  7:10               ` Martin Pohlack
2018-06-05 23:34             ` [MODERATED] Encrypted Message Tim Chen
2018-06-05 23:37               ` Tim Chen
2018-06-07 19:11                 ` Tim Chen
2018-06-07 23:24                   ` [MODERATED] Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1 Andi Kleen
2018-06-08 16:29                     ` Thomas Gleixner
2018-06-08 17:51                       ` [MODERATED] " Josh Poimboeuf
2018-06-11 14:50                         ` Paolo Bonzini
2018-05-30  8:55     ` [MODERATED] " Peter Zijlstra
2018-05-30  9:02     ` Paolo Bonzini
2018-05-31 19:00     ` Jon Masters
     [not found] ` <20180529194322.8B56F610F8@crypto-ml.lab.linutronix.de>
2018-05-29 23:59   ` [MODERATED] Re: [PATCH 2/2] L1TF KVM 2 Andrew Cooper
2018-05-30  8:38     ` Thomas Gleixner
2018-05-30 16:57       ` [MODERATED] " Andrew Cooper
2018-05-30 19:11         ` Thomas Gleixner
2018-05-30 21:10           ` [MODERATED] " Andi Kleen
2018-05-30 23:19             ` Andrew Cooper
     [not found] ` <20180529194239.768D561107@crypto-ml.lab.linutronix.de>
2018-06-01 16:48   ` [MODERATED] Re: [PATCH 1/2] L1TF KVM 1 Konrad Rzeszutek Wilk
2018-06-04 14:56     ` Paolo Bonzini
     [not found] ` <20180529194236.EDB8561100@crypto-ml.lab.linutronix.de>
2018-06-06  0:34   ` Dave Hansen
2018-06-06 14:15     ` Dave Hansen
     [not found] ` <20180529194240.5654A61109@crypto-ml.lab.linutronix.de>
2018-06-08 17:49   ` Josh Poimboeuf
2018-06-08 20:49     ` Konrad Rzeszutek Wilk
2018-06-08 22:13       ` Josh Poimboeuf
  -- strict thread matches above, loose matches on Subject: below --
2019-03-05 16:43 [MODERATED] Starting to go public? Linus Torvalds
2019-03-05 17:02 ` [MODERATED] " Andrew Cooper
2019-03-05 20:36   ` Jiri Kosina
2019-03-05 22:31     ` Andrew Cooper
2019-03-06 16:18       ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 17:10 ` Jon Masters
2019-03-04  1:21 [MODERATED] [PATCH RFC 0/4] Proposed cmdline improvements Josh Poimboeuf
2019-03-04  1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
2019-03-04  3:55   ` [MODERATED] Encrypted Message Jon Masters
2019-03-04  7:30   ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
2019-03-04  7:45     ` [MODERATED] Encrypted Message Jon Masters
2019-03-04  1:24 ` [MODERATED] [PATCH RFC 3/4] 3 Josh Poimboeuf
2019-03-04  3:58   ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 17:17     ` [MODERATED] " Josh Poimboeuf
2019-03-06 16:22       ` [MODERATED] " Jon Masters
2019-03-04  1:25 ` [MODERATED] [PATCH RFC 4/4] 4 Josh Poimboeuf
2019-03-04  4:07   ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
2019-03-01 21:47 ` [patch V6 06/14] MDS basics 6 Thomas Gleixner
2019-03-04  6:28   ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
2019-03-04  6:57   ` [MODERATED] Encrypted Message Jon Masters
2019-03-04  7:06     ` Jon Masters
2019-03-04  8:12       ` Jon Masters
2019-03-05 15:34     ` Thomas Gleixner
2019-03-06 16:21       ` [MODERATED] " Jon Masters
2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
2019-03-04  6:45   ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
2019-03-04  5:47   ` [MODERATED] Encrypted Message Jon Masters
2019-03-04  5:30 ` Jon Masters
2019-02-24 15:07 [MODERATED] [PATCH v6 00/43] MDSv6 Andi Kleen
2019-02-24 15:07 ` [MODERATED] [PATCH v6 10/43] MDSv6 Andi Kleen
2019-02-25 16:30   ` [MODERATED] " Greg KH
2019-02-25 16:41     ` [MODERATED] Encrypted Message Jon Masters
2019-02-24 15:07 ` [MODERATED] [PATCH v6 31/43] MDSv6 Andi Kleen
2019-02-25 15:19   ` [MODERATED] " Greg KH
2019-02-25 15:34     ` Andi Kleen
2019-02-25 15:49       ` Greg KH
2019-02-25 15:52         ` [MODERATED] Encrypted Message Jon Masters
2019-02-25 16:00           ` [MODERATED] " Greg KH
2019-02-25 16:19             ` [MODERATED] " Jon Masters
2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
2019-02-26 14:19   ` [MODERATED] " Josh Poimboeuf
2019-03-01 20:58     ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 22:14       ` Jon Masters
2019-02-21 23:44 [patch V3 0/9] MDS basics 0 Thomas Gleixner
2019-02-21 23:44 ` [patch V3 4/9] MDS basics 4 Thomas Gleixner
2019-02-22  7:45   ` [MODERATED] Encrypted Message Jon Masters
2019-02-20 15:07 [patch V2 00/10] MDS basics+ 0 Thomas Gleixner
2019-02-20 15:07 ` [patch V2 04/10] MDS basics+ 4 Thomas Gleixner
2019-02-20 17:10   ` [MODERATED] " mark gross
2019-02-21 19:26     ` [MODERATED] Encrypted Message Tim Chen
2019-02-19 12:44 [patch 0/8] MDS basics 0 Thomas Gleixner
2019-02-21 16:14 ` [MODERATED] Encrypted Message Jon Masters
2019-02-07 23:41 [MODERATED] [PATCH v3 0/6] PERFv3 Andi Kleen
2019-02-07 23:41 ` [MODERATED] [PATCH v3 2/6] PERFv3 Andi Kleen
2019-02-08  0:51   ` [MODERATED] Re: [SUSPECTED SPAM][PATCH " Andrew Cooper
2019-02-08  9:01     ` Peter Zijlstra
2019-02-08  9:39       ` Peter Zijlstra
2019-02-08 10:53         ` [MODERATED] [RFC][PATCH] performance walnuts Peter Zijlstra
2019-02-15 23:45           ` [MODERATED] Encrypted Message Jon Masters
2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
2019-01-14 19:20   ` [MODERATED] " Dave Hansen
2019-01-18  7:33     ` [MODERATED] Encrypted Message Jon Masters
2019-01-14 23:39   ` Tim Chen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
2019-01-15  1:05   ` [MODERATED] Encrypted Message Tim Chen
2018-06-12 17:29 [MODERATED] FYI - Reading uncached memory Jon Masters
2018-06-14 16:59 ` [MODERATED] Encrypted Message Tim Chen
2018-05-17 20:53 SSB status - V18 pushed out Thomas Gleixner
2018-05-18 13:54 ` [MODERATED] Is: Sleep states ?Was:Re: " Konrad Rzeszutek Wilk
2018-05-18 14:29   ` Thomas Gleixner
2018-05-18 19:50     ` [MODERATED] Encrypted Message Tim Chen
2018-05-02 21:51 [patch V11 00/16] SSB 0 Thomas Gleixner
2018-05-03  4:27 ` [MODERATED] Encrypted Message Tim Chen
2018-04-24  9:06 [MODERATED] L1D-Fault KVM mitigation Joerg Roedel
2018-04-24  9:35 ` [MODERATED] " Peter Zijlstra
2018-04-24  9:48   ` David Woodhouse
2018-04-24 11:04     ` Peter Zijlstra
2018-05-23  9:45       ` David Woodhouse
2018-05-24  9:45         ` Peter Zijlstra
2018-05-24 15:04           ` Thomas Gleixner
2018-05-24 15:33             ` Thomas Gleixner
2018-05-24 23:18               ` [MODERATED] Encrypted Message Tim Chen
2018-05-25 18:22                 ` Tim Chen
2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
2018-05-29 19:29                   ` [MODERATED] Encrypted Message Tim Chen
2018-05-29 21:14                     ` L1D-Fault KVM mitigation Thomas Gleixner
2018-05-30 16:38                       ` [MODERATED] Encrypted Message Tim Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.