From mboxrd@z Thu Jan  1 00:00:00 1970
From: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Subject: Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for
 pkeys
Date: Tue, 8 Mar 2016 17:19:26 +0800
Message-ID: <56DE991E.8070007@linux.intel.com>
References: <1457177252-7577-1-git-send-email-huaitong.han@intel.com>
 <1457177252-7577-4-git-send-email-huaitong.han@intel.com>
 <56DBDF57.4030607@linux.intel.com> <56DCB9BF.4020904@redhat.com>
 <56DE80B8.40900@linux.intel.com> <56DE8D7D.5010302@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: Paolo Bonzini <pbonzini@redhat.com>,
	Huaitong Han <huaitong.han@intel.com>, gleb@kernel.org
Return-path: <kvm-owner@vger.kernel.org>
Received: from mga02.intel.com ([134.134.136.20]:60182 "EHLO mga02.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932360AbcCHJTu (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 8 Mar 2016 04:19:50 -0500
In-Reply-To: <56DE8D7D.5010302@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 03/08/2016 04:29 PM, Paolo Bonzini wrote:
>
>
> On 08/03/2016 08:35, Xiao Guangrong wrote:
>>> well-predicted branches are _faster_ than branchless code.
>>
>> Er, i do not understand this. If these two case have the same cache hit,
>> how can a branch be faster?
>
> Because branchless code typically executes fewer instructions.
>
> Take the same example here:
>
>>>      do {
>>>      } while (level > PT_PAGE_TABLE_LEVEL &&
>>>           (!(gpte & PT_PAGE_SIZE_MASK) ||
>>>            level == mmu->root_level));
>
> The assembly looks like (assuming %level, %gpte and %mmu are registers)
>
> 	cmp $1, %level
> 	jbe 1f
> 	test $128, %gpte
> 	jz beginning_of_loop
> 	cmpb ROOT_LEVEL_OFFSET(%mmu), %level
> 	je beginning_of_loop
> 1:
>
> These are two to six instructions, with no dependency and which the
> processor can change into one to three macro-ops.  For the branchless
> code (I posted a patch to implement this algorithm yesterday):
>
> 	lea -2(%level), %temp1
> 	orl %temp1, %gpte
> 	movzbl LAST_NONLEAF_LEVEL_OFFSET(%mmu), %temp1
> 	movl %level, %temp2
> 	subl %temp1, %temp2
> 	andl %temp2, %gpte
> 	test $128, %gpte
> 	jz beginning_of_loop
>
> These are eight instructions, with some dependencies between them too.
> In some cases branchless code throws away the result of 10-15
> instructions (because in the end it's ANDed with 0, for example).  If it
> weren't for mispredictions, the branchy code would be faster.
>

Good lesson, thank you, Paolo. :)

>>> Here none of the branches is easily predicted, so we want to get rid of
>>> them.
>>>
>>> The next patch adds three branches, and they are not all equal:
>>>
>>> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it
>>> should be well predicted if the host is not overcommitted).
>>
>> But, in the production, cpu over-commit is the normal case...
>
> It depends on the workload.  I would guess that 32-bit OSes are more
> common where you have a single legacy guest because e.g. it doesn't have
> drivers for recent hardware.
>
>>>> However, i do not think we need a new byte index for PK. The conditions
>>>> detecting PK enablement
>>>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S
>>>> access).
>>>
>>> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
>>> too, though Huaitong's patch doesn't do that).  It's a good thing to do.
>>>    U/S is also handled by adding a new byte index, see Huaitong's
>>
>> It is not on the same page, the U/S is the type of memory access which
>> is depended on vCPU runtime.
>
> Do you mean the type of page (ACC_USER_MASK)?  Only U=1 pages are
> subject to PKRU, even in the kernel.  The processor CPL
> (PFERR_USER_MASK) only matters if CR0.WP=0.

No. The index is:
| Byte index: page fault error code [4:1]

So, the type i mentioned is the type of memory access issued by CPU, e,g CPU is
writing the memory or CPU is executing on the memory.

>
>> But the condition whether PKEY is enabled or not
>> is fully depended on the envorment of CPU and we should _always_
>> check PKEY even if PFEC_PKEY is not set.
>>
>> As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal
>> KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request.
>> Wasting a bit is really unnecessary.
>>
>> And it is always better to move more workload from permission_fault() to
>> update_permission_bitmask() as the former is much hotter than the latter.
>
> I agree, but I'm not sure why you say that adding a bits adds more work
> to permission_fault().

A branch to check PFEC.PKEY, which is not well predictable on soft mmu. (It
should always be set in EPT as the page table walking is done by software,
however, if we only consider EPT we can assume it is always true).

>
> Adding a bit lets us skip CR4.PKU and EFER.LMA checks in
> permission_fault() and in all gva_to_gpa() callers.

The point is when we can clear this bit to skip these checks. We should
_always_ check PKEY even if PFEC.PKEY = 0, because:
1) all gva_to_gpa()s issued by KVM should always check PKEY. This is the
    case of ept only.

2) if the feature is enabled in softmmu, shadow page table may change its
    behavior, for example, the mmio-access causes a reserved PF which
    may clear PFEC.PKEY.

And skipping these checks is not really necessary as we can take them into
account when we update the bitmask.

>
> So my proposal is to compute the "effective" PKRU bits (i.e. extract the
> relevant AD and WD bits, and mask away WD if irrelevant) in
> update_permission_bitmask(), and add PFERR_PK_MASK to the error code if
> they are nonzero.
>
> PFERR_PK_MASK must be computed in permission_fault().  It's a runtime
> condition that it's not known before.
>

Yes, you are right.