From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tim.c.chen@linux.intel.com>
Received: from mga04.intel.com ([192.55.52.120])
	by Galois.linutronix.de with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256)
	(Exim 4.80)
	(envelope-from <tim.c.chen@linux.intel.com>)
	id 1fMHMS-00079F-Lz
	for speck@linutronix.de; Fri, 25 May 2018 20:22:45 +0200
References: <20180424090630.wlghmrpasn7v7wbn@suse.de>
 <20180424093537.GC4064@hirez.programming.kicks-ass.net>
 <1524563292.8691.38.camel@infradead.org>
 <20180424110445.GU4043@hirez.programming.kicks-ass.net>
 <1527068745.8186.89.camel@infradead.org>
 <20180524094526.GE12198@hirez.programming.kicks-ass.net>
 <alpine.DEB.2.21.1805241201510.1577@nanos.tec.linutronix.de>
 <alpine.DEB.2.21.1805241729520.1577@nanos.tec.linutronix.de>
 <d2029ba2-bdad-5bb9-596d-f22a9bfa5b9a@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Message-ID: <786ae2c4-48ee-4af0-15fa-23659ac63adf@linux.intel.com>
Date: Fri, 25 May 2018 11:22:37 -0700
MIME-Version: 1.0
In-Reply-To: <d2029ba2-bdad-5bb9-596d-f22a9bfa5b9a@linux.intel.com>
Subject: [MODERATED] Encrypted Message
Content-Type: multipart/mixed; boundary="EXKH5Di8TLJv5toFS19FDlxnjmXqpQ9vi"; protected-headers="v1"
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

This is an OpenPGP/MIME encrypted message (RFC 4880 and 3156)
--EXKH5Di8TLJv5toFS19FDlxnjmXqpQ9vi
Content-Type: text/rfc822-headers; protected-headers="v1"
Content-Disposition: inline

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Tim Chen <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

--EXKH5Di8TLJv5toFS19FDlxnjmXqpQ9vi
Content-Type: text/plain; charset=windows-1252
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable

On 05/24/2018 04:18 PM, speck for Tim Chen wrote:
> On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse =
wrote:
>>>>> The microcode trick just makes it a lot easier because we don't
>>>>> have to *explicitly* pause the sibling vCPUs and manage their state=
 on
>>>>> every vmexit/entry. And avoids potential race conditions with manag=
ing
>>>>> that in software.
>>>>
>>>> Yes, it would certainly help and avoid a fair bit of ugly. It would,=
 for
>>>> instance, avoid having to modify irq_enter() / irq_exit(), which wou=
ld
>>>> otherwise be required (and possibly leak all data touched up until t=
hat
>>>> point is reached).
>>>>
>>>> But even with all that, adding L1-flush to every VMENTER will hurt l=
ots.
>>>> Consider for example the PIO emulation used when booting a guest fro=
m a
>>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>>
>>> Just did a test on SKL Client where I have ucode. It does not have HT=
 so
>>> its not suffering from any HT side effects when L1D is flushed.
>>>
>>> Boot time from a disk image is ~1s measured from the first vcpu enter=
=2E
>>>
>>> With L1D Flush on vmenter the boot time is about 5-10% slower. And th=
at has
>>> lots of PIO operations in the early boot.
>>>
>>> For a kernel build the L1D Flush has an overhead of < 1%.
>>>
>>> Netperf guest to host has a slight drop of the throughput in the 2%
>>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>>
>>> Now I isolated two host CPUs and pinned the two vCPUs on it to be abl=
e to
>>> measure the overhead. Running cyclictest with a period of 25us in the=
 guest
>>> on a isolated guest CPU and monitoring the behaviour with perf on the=
 host
>>> for the corresponding host CPU gives
>>>
>>> No Flush	      	       Flush
>>>
>>> 1.31 insn per cycle	       1.14 insn per cycle
>>>
>>> 2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec
>>>
>>> In that simple test the L1D misses go up by a factor of 13.
>>>
>>> Now with the whole gang scheduling the numbers I heard through the
>>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot=
 from
>>> disk image. 13 minutes instead of 6 seconds...
>=20
> The performance is highly dependent on how often we VM exit.
> Working with Peter Z on his prototype, the performance ranges from
> no regression for a network loop back, ~20% regression for kernel compi=
le
> to ~100% regression on File IO.  PIO brings out the worse aspect
> of the synchronization overhead as we VM exit on every dword PIO read i=
n, and the
> kernel and initrd image was about 50 MB for the experiment, and led to
> 13 min of load time.
>=20
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.
>=20
> (Note: I haven't added in the L1 flush on VM entry for my experiment, t=
hat is on
> the todo).

As a post note, I added in the L1 flush and the performance numbers
pretty much stay the same.  So the synchronization overhead is
dominant and L1 flush overhead is secondary.

Tim


--EXKH5Di8TLJv5toFS19FDlxnjmXqpQ9vi--