From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <peterz@infradead.org>
Received: from merlin.infradead.org ([2001:8b0:10b:1231::1])
	by Galois.linutronix.de with esmtps (TLS1.2:RSA_AES_256_CBC_SHA256:256)
	(Exim 4.80)
	(envelope-from <peterz@infradead.org>)
	id 1fLmoM-00023d-Mm
	for speck@linutronix.de; Thu, 24 May 2018 11:45:31 +0200
Received: from j217100.upc-j.chello.nl ([24.132.217.100]
 helo=hirez.programming.kicks-ass.net)	by merlin.infradead.org with esmtpsa
 (Exim 4.90_1 #2 (Red Hat Linux))	id 1fLmoL-00060N-65	for speck@linutronix.de;
 Thu, 24 May 2018 09:45:29 +0000
Date: Thu, 24 May 2018 11:45:26 +0200
From: Peter Zijlstra <peterz@infradead.org>
Subject: [MODERATED] Re: L1D-Fault KVM mitigation
Message-ID: <20180524094526.GE12198@hirez.programming.kicks-ass.net>
References: <20180424090630.wlghmrpasn7v7wbn@suse.de>
 <20180424093537.GC4064@hirez.programming.kicks-ass.net>
 <1524563292.8691.38.camel@infradead.org>
 <20180424110445.GU4043@hirez.programming.kicks-ass.net>
 <1527068745.8186.89.camel@infradead.org>
MIME-Version: 1.0
In-Reply-To: <1527068745.8186.89.camel@infradead.org>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:

> That's OK because it's only the VMX tasks which can abuse it, isn't it?

If, like you outline below, this is an (optional) ucode assist to
co-scheduling matching VCPU threads, then yes.

> Let's assume we've fixed the problem for normal tasks, by flipping the
> top bit in absent PTEs that actually contain swap pointers, etc.
> 
> The only thing we have left is VM guests. The microcode bit would say
> that *if* a CPU thread is in non-root mode then *it* gets paused unless
> its sibling is also in non-root mode for the same VMID.
> 
> So when both siblings are actually in the VM, they get to run. If one
> sibling comes *out* of the VM to the host kernel or to run (host)
> userspace, then the other one doesn't execute any guest instructions.
> It can take exceptions which cause a vmexit though.

Would it make sense to time limit the being 'stuck', much like PLE ?

> We'd also want a vCPU to be able to run if its sibling is actually in
> the host but *idle* (and has flushed the L1. Perhaps we actually
> automatically flush the L1 when resuming a sibling that got paused).

Right, idle is a wildcard which matches with any VCPU. We don't care
about the cache state of the sibling though. L1 is shared and since
VMENTER must flush L1, that is sufficient.

> It does still depend on gang scheduling (or at least forced sibling
> idle which is a subset of that), or a singleton vCPU might *never* get
> run. But we were going to have to do something along those lines
> anyway.

Linus has opinions on that.. but yes, without that all that remains is
disabling HT afaict.

> The microcode trick just makes it a lot easier because we don't
> have to *explicitly* pause the sibling vCPUs and manage their state on
> every vmexit/entry. And avoids potential race conditions with managing
> that in software.

Yes, it would certainly help and avoid a fair bit of ugly. It would, for
instance, avoid having to modify irq_enter() / irq_exit(), which would
otherwise be required (and possibly leak all data touched up until that
point is reached).

But even with all that, adding L1-flush to every VMENTER will hurt lots.
Consider for example the PIO emulation used when booting a guest from a
disk image. That causes VMEXIT/VMENTER at stupendous rates.

Also, none of this readily addresses the problem of load-balancing
shredding the VCPU localities required for this.