From mboxrd@z Thu Jan  1 00:00:00 1970
Return-path: <pbonzini@redhat.com>
Received: from mx3-rdu2.redhat.com ([66.187.233.73] helo=mx1.redhat.com)
	by Galois.linutronix.de with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256)
	(Exim 4.80)
	(envelope-from <pbonzini@redhat.com>)
	id 1fAxRn-0000WT-9u
	for speck@linutronix.de; Tue, 24 Apr 2018 14:53:28 +0200
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No
 client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id
 AD8CC10D43F	for <speck@linutronix.de>; Tue, 24 Apr 2018 12:53:20 +0000 (UTC)
Received: from [10.36.117.154] (ovpn-117-154.ams2.redhat.com [10.36.117.154])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 2153383B8C
	for <speck@linutronix.de>; Tue, 24 Apr 2018 12:53:19 +0000 (UTC)
Subject: [MODERATED] Re: L1D-Fault KVM mitigation
References: <20180424090630.wlghmrpasn7v7wbn@suse.de>
 <20180424093537.GC4064@hirez.programming.kicks-ass.net>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <8cbc35b2-f75a-6357-014d-e20ff7284ac0@redhat.com>
Date: Tue, 24 Apr 2018 14:53:15 +0200
MIME-Version: 1.0
In-Reply-To: <20180424093537.GC4064@hirez.programming.kicks-ass.net>
Content-Type: multipart/mixed; boundary="02lGJN4ndtXHrGRfg9yhuIPug77DT6yfL"; protected-headers="v1"
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

This is an OpenPGP/MIME encrypted message (RFC 4880 and 3156)
--02lGJN4ndtXHrGRfg9yhuIPug77DT6yfL
Content-Type: text/plain; charset=windows-1252
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable

On 24/04/2018 11:35, speck for Peter Zijlstra wrote:
> I know that I worked a little with Tim on this, and I know Google did
> their own thing (but have not seen patches from them -- is pjt on this
> list?). I've also heard Amazon was also working on things (are they
> here?). And I think RHT was also looking into something (mingo, bonzini=

> -- are you guys reading?)

Yes, I am.  First of all: the cost of doing an L1D flush on every
vmentry is absolutely horrible on KVM microbenchmarks, but seems a
little better (around 6% worst case) on syscall microbenchmarks.
"Message-passing" workloads with vCPUs repeatedly going to sleep are the
worst.

First of all, hyperthreading in general doesn't exactly shine when
running many small virtual machines since the VMs are unlikely to share
any code or data and you'll be able to use half the normal amount of L1
cache.  Perhaps KSM shares the guest kernels and recovers some of the
icache (assuming that there are kernel-heavy benchmarks that _also_
benefit from hyperthreading), but it's more likely to give worse than
improved performance.

Hyperthreading may provide slightly better jitter when you run two
different guests on the siblings.  But with gang scheduling you wouldn't
do that, so that's not an issue.  As a result, in the overcommitted case
the main issue is having to explain to the customers that disabling
hyperthreading is not that bad.

Even in the non-overcommitted case, there is a possibility that host
IRQs or NMIs happen, which as Thomas pointed out can also pollute the cac=
he.

The only case where hyperthreading may be salvaged is the case where 1)
all guest CPUs are pinned to a single physical CPU and memory is also
reserved because you use 1GB hugetlbfs 2) host IRQs are either using
VT-d posted interrupts or are pinned away from those physical CPUs that
run guests, 3) you are using nohz_full and other similar fine-tuned
configuration to ensure that the guest CPUs run smoothly.  This includes
NFV usecases, but big databases like SAP are also run like this sometimes=
=2E

In this case, you'd need to add synchronization around vmexit.  However,
because these workloads _will_ actually do vmexits, sometimes a lot of
them (e.g. unless you use nohz_full in the guest as well, you'll have
vmexits to program the LAPIC timer).  Either all of them will have to
suffer from the synchronization cost, or you have to arbitrarily decide
that some vmexits are "confined" and unlikely to pollute the cache; in
that case you skip the synchronization and the L1D flush.  For example
you could say "anything that does not do get_user_pages is confined".

Because you've done this arbitrary choice, the synchronization is total
security theater unless you know what you're doing: no two guests on the
same core, no interrupt handlers that can run during a vmexit and
pollute the L1 cache (if that happens, the other sibling would be able
to read that data), etc..

BUT: 1) I'm not saying hyperthreading is valuable in those cases, only
that it can be salvaged; 2) if you're paranoid you're more likely to
disable HT anyway.  So while I do plan to test what happens when we do
synchronization, it's all but certain that we're going to ship it.  And
even that would only be if it is acceptable upstream---I'm not going to
make it a special Red Hat only patch.

Ingo suggested, for ease of testing and also for ease of deployment, a
knob to easily online/offline all siblings but the first on each core.
There's still the chance that some userspace daemon is started before
hyperthreading is software-disabled that way, and is confused by the
number of CPUs suddenly halving, so it would have to be both on the
kernel command line and in debugfs.

Paolo


--02lGJN4ndtXHrGRfg9yhuIPug77DT6yfL--