Re: [PATCH v3 00/17] Alternative Meltdown mitigation

From: Juergen Gross <jgross@suse.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: andrew.cooper3@citrix.com, xen-devel@lists.xenproject.org,
	Dario Faggioli <dfaggioli@suse.com>
Subject: Re: [PATCH v3 00/17] Alternative Meltdown mitigation
Date: Tue, 13 Feb 2018 15:29:39 +0100	[thread overview]
Message-ID: <14b3325f-b13f-cfc7-444d-8368f56d5884@suse.com> (raw)
In-Reply-To: <5A83014E02000078001A7619@suse.com>

On 13/02/18 15:16, Jan Beulich wrote:
>>>> On 13.02.18 at 12:36, <jgross@suse.com> wrote:
>> On 12/02/18 18:54, Dario Faggioli wrote:
>>> On Fri, 2018-02-09 at 15:01 +0100, Juergen Gross wrote:
>>>> This series is available via github:
>>>>
>>>> https://github.com/jgross1/xen.git xpti
>>>>
>>>> Dario wants to do some performance tests for this series to compare
>>>> performance with Jan's series with all optimizations posted.
>>>>
>>> And some of this is indeed ready.
>>>
>>> So, this is again on my testbox, with 16 pCPUs and 12GB of RAM, and I
>>> used a guest with 16 vCPUs and 10GB of RAM.
>>>
>>> I benchmarked Jan's patch *plus* all the optimizations and overhead
>>> mitigation patches he posted on xen-devel (the ones that are already in
>>> staging, and also the ones that are not yet there). That's "XPTI-Light" 
>>> in the table and in the graphs. Booting this with 'xpti=false' is
>>> considered the baseline, while booting with 'xpti=true' is the actual
>>> thing we want to measure. :-)
>>>
>>> Then I ran the same benchmarks on Juergen's branch above, enabled at
>>> boot. That's "XPYI" in the table and graphs (yes, I know, sorry for the
>>> typo!).
>>>
>>> http://openbenchmarking.org/result/1802125-DARI-180211144 
>>>
>> http://openbenchmarking.org/result/1802125-DARI-180211144&obr_hgv=XPTI-Light+x 
>> pti%3Dfalse&obr_nor=y&obr_hgv=XPTI-Light+xpti%3Dfalse
>>
>> ...
>>
>>> Or, actually, that's not it! :-O In fact, right while I was writing
>>> this report, it came out on IRC that something can be done, on
>>> Juergen's XPTI series, to mitigate the performance impact a bit.
>>>
>>> Juergen sent me a patch already, and I'm re-running the benchmarks with
>>> that applied. I'll let know how the results ends up looking like.
>>
>> It turned out the results are not basically different. So the general
>> problem with context switches is still there (which I expected, BTW).
>>
>> So I guess the really bad results with benchmarks triggering a lot of
>> vcpu scheduling show that my approach isn't going to fly, as the most
>> probable cause for the slow context switches are the introduced
>> serializing instructions (LTR, WRMSRs) which can't be avoided when we
>> want to use per-vcpu stacks.
>>
>> OTOH the results of the other benchmarks showing some advantage over
>> Jan's solution indicate there is indeed an aspect which can be improved.
>>
>> Instead of preferring one approach over the other I have thought about
>> a way to use the best parts of each solution in a combined variant. In
>> case nobody is feeling strong to pursue my current approach further I'd
>> like to suggest the following scheme:
>>
>> - Whenever a L4 page table of the guest is in use on one physical cpu
>>   only use the L4 shadow cache of my series in order to avoid having to
>>   copy the L4 contents each time the hypervisor is left.
>>
>> - As soon as a L4 page table is being activated on a second cpu fall
>>   back to use the per-cpu page table on that cpu (the cpu already using
>>   the L4 page table can continue doing so).
> 
> Would the first of these CPUs continue to run on the shadow L4 in
> that case? If so, would there be no synchronization issues? If not,
> how do you envision "telling" it to move to the per-CPU L4 (which,
> afaict, includes knowing which vCPU / pCPU that is)?

I thought to let the CPU running on the shadow L4. This L4 already is
configured for the CPU it is being used on, so we just have to avoid
to activate it on a second CPU.

I don't see synchronization issues as all guest L4 modifications would
be mirrored in the shadow, as done in my series already.

>> - Before activation of a L4 shadow page table it is modified to map the
>>   per-cpu data needed in guest mode for the local cpu only.
> 
> I had been considering to do this in XPTI light for other purposes
> too (for example it might be possible to short circuit the guest
> system call path to get away without multiple page table switches).
> We really first need to settle on how much we feel is safe to expose
> while the guest is running. So far I've been under the impression
> that people actually think we should further reduce exposed pieces
> of code/data, rather than widen the "window".

I would like to have some prepared L3 page tables for each cpu meant to
be hooked into the correct shadow L4 slots. The shadow L4 should map as
few hypervisor parts as possible (again like in my current series).

>> - Use INVPCID instead of %cr4 PGE toggling to speed up purging global
>>   TLB entries (depending on the availability of the feature, of course).
> 
> That's something we should do independent of what XPTI model
> we'd like to retain long term.

Right. That was just for completeness.

>> - Use the PCID feature for being able to avoid purging TLB entries which
>>   might be needed later (depending on hardware again).
> 
> Which first of all raises the question: Does PCID (other than the U
> bit) prevent use of TLB entries in the wrong context? IOW is the
> PCID check done early (during TLB lookup) rather than late (during
> insn retirement)?

We can test this easily. As Linux kernel is already using this mechanism
for Meltdown mitigation I assume it is save, or the Linux kernel way to
avoid Meltdown attacks wouldn't work.

Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel