From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konrad.wilk@oracle.com>
Received: from mail.linutronix.de (146.0.238.70:993) by
  crypto-ml.lab.linutronix.de with IMAP4-SSL for <speck@linutronix.de>; 21 Jun
  2018 14:05:05 -0000
Received: from aserp2120.oracle.com ([141.146.126.78])
	by Galois.linutronix.de with esmtps (TLS1.2:RSA_AES_256_CBC_SHA256:256)
	(Exim 4.80)
	(envelope-from <konrad.wilk@oracle.com>)
	id 1fW0Ct-0008Tb-Db
	for speck@linutronix.de; Thu, 21 Jun 2018 16:05:04 +0200
Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1])
	by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w5LDwhnN112240
	for <speck@linutronix.de>; Thu, 21 Jun 2018 14:04:56 GMT
Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233])
	by aserp2120.oracle.com with ESMTP id 2jmtgx0w9a-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK)
	for <speck@linutronix.de>; Thu, 21 Jun 2018 14:04:56 +0000
Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75])
	by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w5LE4tsg006073
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK)
	for <speck@linutronix.de>; Thu, 21 Jun 2018 14:04:56 GMT
Received: from abhmp0001.oracle.com (abhmp0001.oracle.com [141.146.116.7])
	by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w5LE4t9F026175
	for <speck@linutronix.de>; Thu, 21 Jun 2018 14:04:55 GMT
Date: Thu, 21 Jun 2018 10:04:54 -0400
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: [MODERATED] Re: [PATCH v2.1 5/6] [PATCH v2.1 5/6] Patch #5
Message-ID: <20180621140454.GA2293@char.US.ORACLE.com>
References: <20180620204352.066357420@localhost.localdomain>
 <20180620205701.GD28309@char.us.oracle.com>
 <20180621031759.GC9544@char.us.oracle.com>
MIME-Version: 1.0
In-Reply-To: <20180621031759.GC9544@char.us.oracle.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

On Wed, Jun 20, 2018 at 11:17:59PM -0400, speck for Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 20, 2018 at 04:57:02PM -0400, speck for Konrad Rzeszutek Wilk wrote:
> > On Wed, Jun 20, 2018 at 04:43:01PM -0400, speck for konrad.wilk_at_oracle.com wrote:
> > > kvm: x86: mitigation for L1 cache terminal fault vulnerabilities
> > > 
> > > Intel's new microcode adds a new feature bit in CPUID[7,0].EDX[28].
> > > If it is active, the displacement flush is unnecessary.  Tested on
> > > a Coffee Lake machine.
> > 
> > ooops, that nice commit description you wrote Paolo got messed up. Sorry!
> 
> Ended up fixing this up to be:
> 
> >From a2029479d4c83c8e3476d8b37d288fb441576b43 Mon Sep 17 00:00:00 2001
> From: Paolo Bonzini <pbonzini@redhat.com>
> Date: Wed, 20 Jun 2018 15:49:44 -0400
> Subject: [PATCH v2.2 05/12] kvm: x86: mitigation for L1 cache terminal fault
>  vulnerabilities
> 
> We add two mitigation modes for CVE-2018-3620, aka L1 terminal
> fault.  The two modes are "vmentry_l1d_flush=1" and "vmentry_l1d_flush=2".
> 
> "vmentry_l1d_flush=2" is simply doing an L1 cache flush on every VMENTER.
> "vmentry_l1d_flush=1" is trying to avoid so many L1 cache flueshes on
> VMENTER and instead only does if the reason for entering the hypervisor
> is based on the type of code that is executed. The idea is based on Intel's
> patches, but we treat all vmexits as safe unless they execute
> specific code that is considered unsafe. There is no hardcoded list of
> "safe" exit reasons; but vmexits are considered safe unless:
>  - They trigger the emulator, which could be a good target for
>    other speculative execution-based threats,
>  - or the MMU, which can bring host page tables in the L1 cache.
>  - In addition, executing userspace or another process will trigger a flush.

What about softirqs? Won't that happen on every VMEXIT? And how about
interrupts? They can happen too, any way to figure out whether an interrupt
happend between VMEXIT -> VMENTER? 

Also would it make sense to flip this around - that is everything is considered
a blacklist (aka always flush), except those routines which we believe
are OK and hence can be whitelisted?

And then what about the situation where say in a month somebody fixes a bug
or adds a new feature that brings in the host page table (alloc_page?)
and now we have to move it from whitelist to blacklist - there is not much
of an automatic way to detect this I think?

Or maybe there is? What are some of the calls that KVM can make that would
bring in the host page tables? I am assuming any call that hits 'page_alloc'
or 'kzalloc', 'vzalloc' or such? But others?

Is there a good way to figure this out except running trace on an VMEXIT->VMENTER
using the different kvm-unit tests and seeing this?


> 
> The default is "vmentry_l1d_flush=1".  The cost of "vmentry_l1d_flush=2"
> is up to 2.5x more expensive vmexits on Haswell processors, and 30% on
> Coffee Lake (for the latter, this is independent of whether microcode
> or the generic flush code are used).
> 
> The mitigation does not in any way try to do anything about hyperthreading;
> it is possible for a sibling thread to read data from the cache during a
> vmexit, before the host completes the flush, or to read data from the cache
> while a sibling runs.  The suggestion there is to disable hyperthreading
> unless you've configured your system to dedicate each core to a specific
> guest.

I vaguely recall folks saying that they tried the "trigger IPI on VMEXIT" - was
there any performance metrix compared to say having SMT disabled/enabled?
And are there any patches for this floating around that have been ported against
upstream kernel?

(Asking as I am thinking to throw to the performance team an obfuscated kernel
with various knobs and see which sucks less).

> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
> v2: Add checks for X86_BUG_L1TF
>    - Rework the commit description.
> v3: change module parameter to S_IRUGO
>     move the kvm_l1d_flush closer to VMENTER
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 11 +++++
>  arch/x86/include/asm/kvm_host.h                 |  8 ++++
>  arch/x86/kvm/mmu.c                              |  1 +
>  arch/x86/kvm/svm.c                              |  1 +
>  arch/x86/kvm/vmx.c                              | 51 +++++++++++++++++++++-
>  arch/x86/kvm/x86.c                              | 58 ++++++++++++++++++++++++-
>  6 files changed, 128 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 4e56e6d0f124..dd1db796a463 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1923,6 +1923,17 @@
>  			SMT (aka Hyper-Threading) enabled then don't load KVM module.
>  			Default is 0 (allow module to be loaded).
>  
> +	kvm.vmentry_l1d_flush=[KVM] Mitigation for L1 Terminal Fault CVE.
> +			Valid arguments: 0, 1, 2
> +
> +			2 does an L1 cache flush on every VMENTER.
> +			1 tries to avoid so many L1 cache flush on VMENTERs and instead
> +			do it only if the kind of code that is executed would lead to
> +			leaking host memory.
> +			0 disables the mitigation
> +
> +			Default is 1 (do L1 cache flush in specific instances)
> +
>  	kvm.mmu_audit=	[KVM] This is a R/W parameter which allows audit
>  			KVM MMU at runtime.
>  			Default is 0 (off)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c13cd28d9d1b..78748925a370 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -713,6 +713,12 @@ struct kvm_vcpu_arch {
>  
>  	/* be preempted when it's in kernel-mode(cpl=0) */
>  	bool preempted_in_kernel;
> +
> +	/* for L1 terminal fault vulnerability */
> +	bool vcpu_unconfined;
> +
> +	/* must flush the L1 Data cache */
> +	bool flush_cache_req;
>  };
>  
>  struct kvm_lpage_info {
> @@ -881,6 +887,7 @@ struct kvm_vcpu_stat {
>  	u64 signal_exits;
>  	u64 irq_window_exits;
>  	u64 nmi_window_exits;
> +	u64 l1d_flush;
>  	u64 halt_exits;
>  	u64 halt_successful_poll;
>  	u64 halt_attempted_poll;
> @@ -1449,6 +1456,7 @@ bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
>  
>  void kvm_set_msi_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
>  		     struct kvm_lapic_irq *irq);
> +void kvm_l1d_flush(void);
>  
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index d594690d8b95..4d4e3dc2494e 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3840,6 +3840,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  {
>  	int r = 1;
>  
> +	vcpu->arch.vcpu_unconfined = true;
>  	switch (vcpu->arch.apf.host_apf_reason) {
>  	default:
>  		trace_kvm_page_fault(fault_address, error_code);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index f059a73f0fd0..fffc447f5410 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5437,6 +5437,7 @@ static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
>  
>  static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
>  {
> +	vcpu->arch.flush_cache_req = false;
>  }
>  
>  static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 559a12b6184d..d8ef111b8f5f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -71,6 +71,9 @@ static const struct x86_cpu_id vmx_cpu_id[] = {
>  };
>  MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id);
>  
> +static int __read_mostly vmentry_l1d_flush = 1;
> +module_param_named(vmentry_l1d_flush, vmentry_l1d_flush, int, S_IRUGO);
> +
>  static bool __read_mostly enable_vpid = 1;
>  module_param_named(vpid, enable_vpid, bool, 0444);
>  
> @@ -2618,6 +2621,45 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
>  				   vmx->guest_msrs[i].mask);
>  }
>  
> +static void vmx_prepare_guest_switch(struct kvm_vcpu *vcpu)
> +{
> +	vmx_save_host_state(vcpu);
> +
> +	if (!enable_ept || static_cpu_has(X86_FEATURE_HYPERVISOR) ||
> +	    !static_cpu_has(X86_BUG_L1TF)) {
> +		vcpu->arch.flush_cache_req = false;
> +		return;
> +	}
> +
> +	switch (vmentry_l1d_flush) {
> +	case 0:
> +		vcpu->arch.flush_cache_req = false;
> +		break;
> +	case 1:
> +		/*
> +		 * If vmentry_l1d_flush is 1, each vmexit handler is responsible for
> +		 * setting vcpu->arch.vcpu_unconfined.  Currently this happens in the
> +		 * following cases:
> +		 * - vmlaunch/vmresume: we do not want the cache to be cleared by a
> +		 *   nested hypervisor *and* by KVM on bare metal, so we just do it
> +		 *   on every nested entry.  Nested hypervisors do not bother clearing
> +		 *   the cache.
> +		 * - anything that runs the emulator (the slow paths for EPT misconfig
> +		 *   or I/O instruction)
> +		 * - anything that can cause get_user_pages (EPT violation, and again
> +		 *   the slow paths for EPT misconfig or I/O instruction)
> +		 * - anything that can run code outside KVM (external interrupt,
> +		 *   which can run interrupt handlers or irqs; or the sched_in
> +		 *   preempt notifier)
> +		 */
> +		break;
> +	case 2:
> +	default:
> +		vcpu->arch.flush_cache_req = true;
> +		break;
> +	}
> +}
> +
>  static void __vmx_load_host_state(struct vcpu_vmx *vmx)
>  {
>  	if (!vmx->host_state.loaded)
> @@ -9751,6 +9793,7 @@ static void vmx_handle_external_intr(struct kvm_vcpu *vcpu)
>  			[ss]"i"(__KERNEL_DS),
>  			[cs]"i"(__KERNEL_CS)
>  			);
> +		vcpu->arch.vcpu_unconfined = true;
>  	}
>  }
>  STACK_FRAME_NON_STANDARD(vmx_handle_external_intr);
> @@ -10008,6 +10051,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	evmcs_rsp = static_branch_unlikely(&enable_evmcs) ?
>  		(unsigned long)&current_evmcs->host_rsp : 0;
>  
> +	if (vcpu->arch.flush_cache_req)
> +		kvm_l1d_flush();
> +
>  	asm(
>  		/* Store host registers */
>  		"push %%" _ASM_DX "; push %%" _ASM_BP ";"
> @@ -11811,6 +11857,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  		return ret;
>  	}
>  
> +	/* Hide L1D cache contents from the nested guest.  */
> +	vmx->vcpu.arch.vcpu_unconfined = true;
> +
>  	/*
>  	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
>  	 * by event injection, halt vcpu.
> @@ -12928,7 +12977,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.vcpu_free = vmx_free_vcpu,
>  	.vcpu_reset = vmx_vcpu_reset,
>  
> -	.prepare_guest_switch = vmx_save_host_state,
> +	.prepare_guest_switch = vmx_prepare_guest_switch,
>  	.vcpu_load = vmx_vcpu_load,
>  	.vcpu_put = vmx_vcpu_put,
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1065d4e7c5fd..b3365883ccaf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -199,6 +199,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>  	{ "irq_injections", VCPU_STAT(irq_injections) },
>  	{ "nmi_injections", VCPU_STAT(nmi_injections) },
>  	{ "req_event", VCPU_STAT(req_event) },
> +	{ "l1d_flush", VCPU_STAT(l1d_flush) },
>  	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
>  	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
>  	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
> @@ -6054,6 +6055,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>  	bool writeback = true;
>  	bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
>  
> +	vcpu->arch.vcpu_unconfined = true;
> +
>  	/*
>  	 * Clear write_fault_to_shadow_pgtable here to ensure it is
>  	 * never reused.
> @@ -6537,10 +6540,49 @@ static struct notifier_block pvclock_gtod_notifier = {
>  };
>  #endif
>  
> +
> +/*
> + * The L1D cache is 32 KiB on Skylake, but to flush it we have to read in
> + * 64 KiB because the replacement algorithm is not exactly LRU.
> + */
> +#define L1D_CACHE_ORDER 4
> +static void *__read_mostly empty_zero_pages;
> +
> +void kvm_l1d_flush(void)
> +{
> +	/* FIXME: could this be boot_cpu_data.x86_cache_size * 2?  */
> +	int size = PAGE_SIZE << L1D_CACHE_ORDER;
> +
> +	ASSERT(boot_cpu_has(X86_BUG_L1TF));
> +
> +	asm volatile(
> +		/* First ensure the pages are in the TLB */
> +		"xorl %%eax, %%eax\n\t"
> +		"11: \n\t"
> +		"movzbl (%0, %%" _ASM_AX "), %%ecx\n\t"
> +		"addl $4096, %%eax\n\t"
> +		"cmpl %%eax, %1\n\t"
> +		"jne 11b\n\t"
> +		"xorl %%eax, %%eax\n\t"
> +		"cpuid\n\t"
> +		/* Now fill the cache */
> +		"xorl %%eax, %%eax\n\t"
> +		"12:\n\t"
> +		"movzbl (%0, %%" _ASM_AX "), %%ecx\n\t"
> +		"addl $64, %%eax\n\t"
> +		"cmpl %%eax, %1\n\t"
> +		"jne 12b\n\t"
> +		"lfence\n\t"
> +		: : "r" (empty_zero_pages), "r" (size)
> +		: "eax", "ebx", "ecx", "edx");
> +}
> +EXPORT_SYMBOL_GPL(kvm_l1d_flush);
> +
>  int kvm_arch_init(void *opaque)
>  {
>  	int r;
>  	struct kvm_x86_ops *ops = opaque;
> +	struct page *page;
>  
>  	if (kvm_x86_ops) {
>  		printk(KERN_ERR "kvm: already loaded the other module\n");
> @@ -6569,10 +6611,15 @@ int kvm_arch_init(void *opaque)
>  				"being able to snoop the host memory!");
>  	}
>  	r = -ENOMEM;
> +	page = alloc_pages(GFP_ATOMIC, L1D_CACHE_ORDER);
> +	if (!page)
> +		goto out;
> +	empty_zero_pages = page_address(page);
> +
>  	shared_msrs = alloc_percpu(struct kvm_shared_msrs);
>  	if (!shared_msrs) {
>  		printk(KERN_ERR "kvm: failed to allocate percpu kvm_shared_msrs\n");
> -		goto out;
> +		goto out_free_zero_pages;
>  	}
>  
>  	r = kvm_mmu_module_init();
> @@ -6603,6 +6650,8 @@ int kvm_arch_init(void *opaque)
>  
>  	return 0;
>  
> +out_free_zero_pages:
> +	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
>  out_free_percpu:
>  	free_percpu(shared_msrs);
>  out:
> @@ -6627,6 +6676,7 @@ void kvm_arch_exit(void)
>  #endif
>  	kvm_x86_ops = NULL;
>  	kvm_mmu_module_exit();
> +	free_pages((unsigned long)empty_zero_pages, L1D_CACHE_ORDER);
>  	free_percpu(shared_msrs);
>  }
>  
> @@ -7405,7 +7455,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  	preempt_disable();
>  
> +	vcpu->arch.flush_cache_req = vcpu->arch.vcpu_unconfined;
>  	kvm_x86_ops->prepare_guest_switch(vcpu);
> +	vcpu->arch.vcpu_unconfined = false;
> +	if (vcpu->arch.flush_cache_req)
> +		vcpu->stat.l1d_flush++;
>  
>  	/*
>  	 * Disable IRQs before setting IN_GUEST_MODE.  Posted interrupt
> @@ -7592,6 +7646,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  	struct kvm *kvm = vcpu->kvm;
>  
>  	vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
> +	vcpu->arch.vcpu_unconfined = true;
>  
>  	for (;;) {
>  		if (kvm_vcpu_running(vcpu)) {
> @@ -8711,6 +8766,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>  
>  void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
>  {
> +	vcpu->arch.vcpu_unconfined = true;
>  	kvm_x86_ops->sched_in(vcpu, cpu);
>  }
>  
> -- 
> 2.13.4
>