From mboxrd@z Thu Jan  1 00:00:00 1970
Return-path: =?utf-8?q?=3CBATV+662c0d45ef9886d85002+5386+infradead=2Eorg+d?=
 =?utf-8?q?wmw2=40twosheds=2Esrs=2Einfradead=2Eorg=3E?=
Received: from twosheds.infradead.org ([2001:8b0:10b:1:21d:7dff:fe04:dbe2])	by
 Galois.linutronix.de with esmtps (TLS1.2:RSA_AES_256_CBC_SHA256:256)	(Exim
 4.80)	(envelope-from =?utf-8?q?=3CBATV+662c0d45ef9886d85002+5386+infradea?=
 =?utf-8?q?d=2Eorg+dwmw2=40twosheds=2Esrs=2Einfradead=2Eorg=3E=29?=	id
 1fLQL6-0000Nz-Sn	for speck@linutronix.de; Wed, 23 May 2018 11:45:50 +0200
Received: from [2001:8b0:10b:1::b8f]
	by twosheds.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux))
	id 1fLQL4-0002rE-9R
	for speck@linutronix.de; Wed, 23 May 2018 09:45:46 +0000
Message-ID: <1527068745.8186.89.camel@infradead.org>
Subject: [MODERATED] Re: L1D-Fault KVM mitigation
From: David Woodhouse <dwmw2@infradead.org>
In-Reply-To: <20180424110445.GU4043@hirez.programming.kicks-ass.net>
References: <20180424090630.wlghmrpasn7v7wbn@suse.de>
	 <20180424093537.GC4064@hirez.programming.kicks-ass.net>
	 <1524563292.8691.38.camel@infradead.org>
	 <20180424110445.GU4043@hirez.programming.kicks-ass.net>
Date: Wed, 23 May 2018 10:45:45 +0100
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: speck@linutronix.de
List-ID: <speck.linutronix.de>


On Tue, 2018-04-24 at 13:04 +0200, speck for Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 10:48:12AM +0100, speck for David Woodhouse wrote:
> > 
> > On Tue, 2018-04-24 at 11:35 +0200, speck for Peter Zijlstra wrote:
> > > 
> > > 
> > > Another option, that is being explored, is to co-schedule siblings.
> > > So ensure all siblings either run vcpus of the _same_ VM or idle.
> > > 
> > > Of course, this is all rather intrusive and ugly and brings with it
> > > setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> > > and interrupts (on the idle CPUs).
>
> > I hate to suggest more microcode hacks but... if there was an MSR bit
> > which, when set, would pause any HT sibling that was currently in VMX
> > non-root mode, then we could set that up to be automatically set on
> > vmexit and it would automatically pause the problematic siblings.
> > Meaning that co-ordinating vmexits with them might actually be
> > feasible?

> Not sure I'm following. The above assumes a sibling is running a VCPU of
> another VM, right? But it could equally well run any regular old task
> (including idle).
> 
> So only pausing siblings in VMX mode wouldn't help anything. The !VMX
> tasks could still be loading stuff into L1.

That's OK because it's only the VMX tasks which can abuse it, isn't it?

Let's assume we've fixed the problem for normal tasks, by flipping the
top bit in absent PTEs that actually contain swap pointers, etc.

The only thing we have left is VM guests. The microcode bit would say
that *if* a CPU thread is in non-root mode then *it* gets paused unless
its sibling is also in non-root mode for the same VMID.

So when both siblings are actually in the VM, they get to run. If one
sibling comes *out* of the VM to the host kernel or to run (host)
userspace, then the other one doesn't execute any guest instructions.
It can take exceptions which cause a vmexit though.

We'd also want a vCPU to be able to run if its sibling is actually in
the host but *idle* (and has flushed the L1. Perhaps we actually
automatically flush the L1 when resuming a sibling that got paused).

It does still depend on gang scheduling (or at least forced sibling
idle which is a subset of that), or a singleton vCPU might *never* get
run. But we were going to have to do something along those lines
anyway. The microcode trick just makes it a lot easier because we don't
have to *explicitly* pause the sibling vCPUs and manage their state on
every vmexit/entry. And avoids potential race conditions with managing
that in software.