From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tim.c.chen@linux.intel.com>
Received: from mga14.intel.com ([192.55.52.115])
	by Galois.linutronix.de with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256)
	(Exim 4.80)
	(envelope-from <tim.c.chen@linux.intel.com>)
	id 1fLzUp-0003pS-Pe
	for speck@linutronix.de; Fri, 25 May 2018 01:18:12 +0200
References: <20180424090630.wlghmrpasn7v7wbn@suse.de>
 <20180424093537.GC4064@hirez.programming.kicks-ass.net>
 <1524563292.8691.38.camel@infradead.org>
 <20180424110445.GU4043@hirez.programming.kicks-ass.net>
 <1527068745.8186.89.camel@infradead.org>
 <20180524094526.GE12198@hirez.programming.kicks-ass.net>
 <alpine.DEB.2.21.1805241201510.1577@nanos.tec.linutronix.de>
 <alpine.DEB.2.21.1805241729520.1577@nanos.tec.linutronix.de>
From: Tim Chen <tim.c.chen@linux.intel.com>
Message-ID: <d2029ba2-bdad-5bb9-596d-f22a9bfa5b9a@linux.intel.com>
Date: Thu, 24 May 2018 16:18:08 -0700
MIME-Version: 1.0
In-Reply-To: <alpine.DEB.2.21.1805241729520.1577@nanos.tec.linutronix.de>
Subject: [MODERATED] Encrypted Message
Content-Type: multipart/mixed; boundary="e8XP3ERilYTVqSHnTi3NhZMHrEbEO3sOI"; protected-headers="v1"
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

This is an OpenPGP/MIME encrypted message (RFC 4880 and 3156)
--e8XP3ERilYTVqSHnTi3NhZMHrEbEO3sOI
Content-Type: text/rfc822-headers; protected-headers="v1"
Content-Disposition: inline

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

--e8XP3ERilYTVqSHnTi3NhZMHrEbEO3sOI
Content-Type: text/plain; charset=windows-1252
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable

On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse w=
rote:
>>>> The microcode trick just makes it a lot easier because we don't
>>>> have to *explicitly* pause the sibling vCPUs and manage their state =
on
>>>> every vmexit/entry. And avoids potential race conditions with managi=
ng
>>>> that in software.
>>>
>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, =
for
>>> instance, avoid having to modify irq_enter() / irq_exit(), which woul=
d
>>> otherwise be required (and possibly leak all data touched up until th=
at
>>> point is reached).
>>>
>>> But even with all that, adding L1-flush to every VMENTER will hurt lo=
ts.
>>> Consider for example the PIO emulation used when booting a guest from=
 a
>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>
>> Just did a test on SKL Client where I have ucode. It does not have HT =
so
>> its not suffering from any HT side effects when L1D is flushed.
>>
>> Boot time from a disk image is ~1s measured from the first vcpu enter.=

>>
>> With L1D Flush on vmenter the boot time is about 5-10% slower. And tha=
t has
>> lots of PIO operations in the early boot.
>>
>> For a kernel build the L1D Flush has an overhead of < 1%.
>>
>> Netperf guest to host has a slight drop of the throughput in the 2%
>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>
>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able=
 to
>> measure the overhead. Running cyclictest with a period of 25us in the =
guest
>> on a isolated guest CPU and monitoring the behaviour with perf on the =
host
>> for the corresponding host CPU gives
>>
>> No Flush	      	       Flush
>>
>> 1.31 insn per cycle	       1.14 insn per cycle
>>
>> 2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec
>>
>> In that simple test the L1D misses go up by a factor of 13.
>>
>> Now with the whole gang scheduling the numbers I heard through the
>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot =
from
>> disk image. 13 minutes instead of 6 seconds...

The performance is highly dependent on how often we VM exit.
Working with Peter Z on his prototype, the performance ranges from
no regression for a network loop back, ~20% regression for kernel compile=

to ~100% regression on File IO.  PIO brings out the worse aspect
of the synchronization overhead as we VM exit on every dword PIO read in,=
 and the
kernel and initrd image was about 50 MB for the experiment, and led to
13 min of load time.

We may need to do the co-scheduling only when VM exit rate is low, and
turn off the SMT when VM exit rate becomes too high.

(Note: I haven't added in the L1 flush on VM entry for my experiment, tha=
t is on
the todo).

Tim


>>
>> That's not surprising at all, though the magnitude is way higher than =
I
>> expected. I don't see a realistic chance for vmexit heavy workloads to=
 work
>> with that synchronization thing at all, whether it's ucode assisted or=
 not.
>=20
> That said, I think we should stage the host side mitigations plus the L=
1
> flush on vmenter ASAP so we are not standing there with our pants down =
when
> the cat comes out of the bag early. That means HT off, but it's still
> better than having absolutely nothing.
>=20
> The gang scheduling nonsense can be added on top when it should
> surprisingly turn out to be usable at all.
>=20
> Thanks,
>=20
> 	tglx
>=20


--e8XP3ERilYTVqSHnTi3NhZMHrEbEO3sOI--