From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi.kivity@gmail.com>
Subject: Re: 2 CPU Conformance Issue in KVM/x86
Date: Mon, 09 Mar 2015 21:19:29 +0200
Message-ID: <54FDF241.8080002@gmail.com>
References: <A6F671BC-983C-4005-87E9-FCC68DEF0D30@gmail.com> <54F58471.7020906@redhat.com> <54FDD39C.9060908@gmail.com> <6073FF8F-E261-4DC3-817A-9F4A46B5C0DB@gmail.com> <54FDE50B.8040408@gmail.com> <13DCF857-5591-4499-9B0D-4165268E9CE8@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm list <kvm@vger.kernel.org>,
	=?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>
To: Nadav Amit <nadav.amit@gmail.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-we0-f171.google.com ([74.125.82.171]:42167 "EHLO
	mail-we0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751340AbbCITTe (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 9 Mar 2015 15:19:34 -0400
Received: by wesq59 with SMTP id q59so21117313wes.9
        for <kvm@vger.kernel.org>; Mon, 09 Mar 2015 12:19:32 -0700 (PDT)
In-Reply-To: <13DCF857-5591-4499-9B0D-4165268E9CE8@gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 03/09/2015 09:07 PM, Nadav Amit wrote:
> Avi Kivity <avi.kivity@gmail.com> wrote:
>
>> On 03/09/2015 07:51 PM, Nadav Amit wrote:
>>> Avi Kivity <avi.kivity@gmail.com> wrote:
>>>
>>>> On 03/03/2015 11:52 AM, Paolo Bonzini wrote:
>>>>>> In this
>>>>>> case, the VM might expect exceptions when PTE bits which are hig=
her than the
>>>>>> maximum (reported) address width are set, and it would not get s=
uch
>>>>>> exceptions. This problem can easily be experienced by small chan=
ge to the
>>>>>> existing KVM unit-tests.
>>>>>>
>>>>>> There are many variants to this problem, and the only solution w=
hich I
>>>>>> consider complete is to report to the VM the maximum (52) physic=
al address
>>>>>> width to the VM, configure the VM to exit on #PF with reserved-b=
it
>>>>>> error-codes, and then emulate these faulting instructions.
>>>>> Not even that would be a definitive solution.  If the guest tries=
 to map
>>>>> RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYA=
DDR,
>>>>> you would get EPT misconfiguration vmexits.
>>>>>
>>>>> I think there is no way to emulate physical address width correct=
ly,
>>>>> except by disabling EPT.
>>>> Is the issue emulating a higher MAXPHYADDR on the guest than is av=
ailable
>>>> on the host? I don't think there's any need to support that.
>>>>
>>>> Emulating a lower setting on the guest than is available on the ho=
st is, I
>>>> think, desirable. Whether it would work depends on the relative pr=
iority
>>>> of EPT misconfiguration exits vs. page table permission faults.
>>> Thanks for the feedback.
>>>
>>> Guest page-table permissions faults got priority over EPT misconfig=
uration.
>>> KVM can even be set to trap page-table permission faults, at least =
in VT-x.
>>> Anyhow, I don=E2=80=99t think it is enough.
>> Why is it not enough? If you trap a permission fault, you can inject=
 any exception error code you like.
> Because there is no real permission fault. In the following example, =
the VM
> expects one (VM=E2=80=99s MAXPHYADDR=3D40), but there isn=E2=80=99t (=
Host=E2=80=99s MAXPHYADDR=3D46), so
> the hypervisor cannot trap it. It can only trap all #PF, which is obv=
iously
> too intrusive.

There are three cases:

1) The guest has marked the page as not present.  In this case, no=20
reserved bits are set and the guest should receive its #PF.
2) The page is present and the permissions are sufficient.  In this=20
case, you will get an EPT misconfiguration and can proceed to inject a=20
#PF with the reserved bit flag set.
3) The page is present but permissions are not sufficient.  In this cas=
e=20
you can trap the fault via the PFEC_MASK register and inject a #PF to=20
the guest.

So you can emulate it and only trap permission faults.  It's still too=20
expensive though.


>>>   Here is an example
>>>
>>> My machine has MAXPHYADDR of 46. I modified kvm-unit-tests access t=
est to
>>> set pte.45 instead of pte.51, which from the VM point-of-view shoul=
d cause
>>> the #PF error-code indicate the reserved bits are set (just as pte.=
51 does).
>>> Here is one error from the log:
>>>
>>> test pte.p pte.45 pde.p user: FAIL: error code 5 expected d
>>> Dump mapping: address: 123400000000
>>> ------L4: 304b007
>>> ------L3: 304c007
>>> ------L2: 304d001
>>> ------L1: 200002000001
>> This is with an ept misconfig programmed into that address, yes?
> A reserved bit in the PTE is set - from the VM point-of-view. If ther=
e
> wasn=E2=80=99t another cause for #PF, it would lead to EPT violation/=
misconfig.
>
>>> As you can see, the #PF should have had two reasons: reserved bits,=
 and user
>>> access to supervisor only page. The error-code however does not ind=
icate the
>>> reserved-bits are set.
>>>
>>> Note that KVM did not trap any exit on that faulting instruction, a=
s
>>> otherwise it would try to emulate the instruction and assuming it i=
s
>>> supported (and that the #PF was not on an instruction fetch), shoul=
d be able
>>> to emulate the #PF correctly.
>>> [ The test actually crashes soon after this error due to these reas=
ons. ]
>>>
>>> Anyhow, that is the reason for me to assume that having the maximum
>>> MAXPHYADDR is better.
>> Well, that doesn't work for the reasons Paolo noted.  The guest can =
have a ivshmem device attached, and map it above a host-supported virtu=
al address, and suddenly it goes slow.
> I fully understand. That=E2=80=99s the reason I don=E2=80=99t have a =
reasonable solution.

I can't think of one with reasonable performance either.  Perhaps the=20
maintainers could raise the issue with Intel.  It looks academic but it=
=20
can happen in real life -- KVM for example used to rely on reserved bit=
s=20
faults (it set all bits in the PTE so it wouldn't have been caught by t=
his).