From mboxrd@z Thu Jan  1 00:00:00 1970
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for
 pkeys
Date: Tue, 8 Mar 2016 09:29:49 +0100
Message-ID: <56DE8D7D.5010302@redhat.com>
References: <1457177252-7577-1-git-send-email-huaitong.han@intel.com>
 <1457177252-7577-4-git-send-email-huaitong.han@intel.com>
 <56DBDF57.4030607@linux.intel.com> <56DCB9BF.4020904@redhat.com>
 <56DE80B8.40900@linux.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Cc: kvm@vger.kernel.org
To: Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Huaitong Han <huaitong.han@intel.com>, gleb@kernel.org
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-wm0-f66.google.com ([74.125.82.66]:32861 "EHLO
	mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932157AbcCHI34 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 8 Mar 2016 03:29:56 -0500
Received: by mail-wm0-f66.google.com with SMTP id n186so2683590wmn.0
        for <kvm@vger.kernel.org>; Tue, 08 Mar 2016 00:29:55 -0800 (PST)
In-Reply-To: <56DE80B8.40900@linux.intel.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 08/03/2016 08:35, Xiao Guangrong wrote:
>> well-predicted branches are _faster_ than branchless code.
> 
> Er, i do not understand this. If these two case have the same cache hit,
> how can a branch be faster?

Because branchless code typically executes fewer instructions.

Take the same example here:

>>     do {
>>     } while (level > PT_PAGE_TABLE_LEVEL &&
>>          (!(gpte & PT_PAGE_SIZE_MASK) ||
>>           level == mmu->root_level));

The assembly looks like (assuming %level, %gpte and %mmu are registers)

	cmp $1, %level
	jbe 1f
	test $128, %gpte
	jz beginning_of_loop
	cmpb ROOT_LEVEL_OFFSET(%mmu), %level
	je beginning_of_loop
1:

These are two to six instructions, with no dependency and which the
processor can change into one to three macro-ops.  For the branchless
code (I posted a patch to implement this algorithm yesterday):

	lea -2(%level), %temp1
	orl %temp1, %gpte
	movzbl LAST_NONLEAF_LEVEL_OFFSET(%mmu), %temp1
	movl %level, %temp2
	subl %temp1, %temp2
	andl %temp2, %gpte
	test $128, %gpte
	jz beginning_of_loop

These are eight instructions, with some dependencies between them too.
In some cases branchless code throws away the result of 10-15
instructions (because in the end it's ANDed with 0, for example).  If it
weren't for mispredictions, the branchy code would be faster.

>> Here none of the branches is easily predicted, so we want to get rid of
>> them.
>>
>> The next patch adds three branches, and they are not all equal:
>>
>> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it
>> should be well predicted if the host is not overcommitted).
> 
> But, in the production, cpu over-commit is the normal case...

It depends on the workload.  I would guess that 32-bit OSes are more
common where you have a single legacy guest because e.g. it doesn't have
drivers for recent hardware.

>>> However, i do not think we need a new byte index for PK. The conditions
>>> detecting PK enablement
>>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S
>>> access).
>>
>> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
>> too, though Huaitong's patch doesn't do that).  It's a good thing to do.
>>   U/S is also handled by adding a new byte index, see Huaitong's
> 
> It is not on the same page, the U/S is the type of memory access which
> is depended on vCPU runtime.

Do you mean the type of page (ACC_USER_MASK)?  Only U=1 pages are
subject to PKRU, even in the kernel.  The processor CPL
(PFERR_USER_MASK) only matters if CR0.WP=0.

> But the condition whether PKEY is enabled or not
> is fully depended on the envorment of CPU and we should _always_
> check PKEY even if PFEC_PKEY is not set.
> 
> As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal
> KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request.
> Wasting a bit is really unnecessary.
> 
> And it is always better to move more workload from permission_fault() to
> update_permission_bitmask() as the former is much hotter than the latter.

I agree, but I'm not sure why you say that adding a bits adds more work
to permission_fault().

Adding a bit lets us skip CR4.PKU and EFER.LMA checks in
permission_fault() and in all gva_to_gpa() callers.

So my proposal is to compute the "effective" PKRU bits (i.e. extract the
relevant AD and WD bits, and mask away WD if irrelevant) in
update_permission_bitmask(), and add PFERR_PK_MASK to the error code if
they are nonzero.

PFERR_PK_MASK must be computed in permission_fault().  It's a runtime
condition that it's not known before.

>> I don't like the idea of making permissions[] four times larger.
> 
> Okay, then lets introduce a new field for PKEY separately. Your approach
> , fault_u1w0, looks good to me.
>
>> I think I even prefer if update_permission_bitmask sets up a separate
>> bitmask:
>>
>>         mmu->fault_u1w0 |= (wf && !w) << byte;
>>
>> and then this other bitmap can be tested in permission_fault:
>>
>> -        if (!wf || (!uf && !is_write_protection(vcpu)))
>> -            pkru_bits &= ~(1 << PKRU_WRITE);
>> +        /*
>> +         * fault_u1w0 ignores SMAP and PKRU, so use the
>> +         * partially-computed PFEC that we were given.
>> +         */
>> +        fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1;
>> +        pkru_bits &= ~(1 << PKRU_WRITE) |
>> +            (fault_uw << PKRU_WRITE);
> 
> It looks good to me!

Good. Thanks for reviewing the idea!

Paolo