From mboxrd@z Thu Jan  1 00:00:00 1970
From: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Subject: Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for
 pkeys
Date: Tue, 8 Mar 2016 15:35:20 +0800
Message-ID: <56DE80B8.40900@linux.intel.com>
References: <1457177252-7577-1-git-send-email-huaitong.han@intel.com>
 <1457177252-7577-4-git-send-email-huaitong.han@intel.com>
 <56DBDF57.4030607@linux.intel.com> <56DCB9BF.4020904@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: Paolo Bonzini <pbonzini@redhat.com>,
	Huaitong Han <huaitong.han@intel.com>, gleb@kernel.org
Return-path: <kvm-owner@vger.kernel.org>
Received: from mga04.intel.com ([192.55.52.120]:38246 "EHLO mga04.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753483AbcCHHfo (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 8 Mar 2016 02:35:44 -0500
In-Reply-To: <56DCB9BF.4020904@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 03/07/2016 07:14 AM, Paolo Bonzini wrote:
>
>
> On 06/03/2016 08:42, Xiao Guangrong wrote:
>>>
>>> +        rsvdf = pfec & PFERR_RSVD_MASK;
>>
>> No. RSVD is reserved by SMAP and it should not be used to walk guest
>> page page table.
>
> Agreed.  You can treat your code as if rsvdf was always false.  Reserved
> bits are handled elsewhere.
>
>>> +        pkuf = pfec & PFERR_PK_MASK;
>>>            /*
>>>             * PFERR_RSVD_MASK bit is set in PFEC if the access is not
>>>             * subject to SMAP restrictions, and cleared otherwise. The
>>> @@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct
>>> kvm_vcpu *vcpu,
>>>                     *   clearer.
>>>                     */
>>>                    smap = cr4_smap && u && !uf && !ff;
>>> +
>>> +                /*
>>> +                * PKU:additional mechanism by which the paging
>>> +                * controls access to user-mode addresses based
>>> +                * on the value in the PKRU register. A fault is
>>> +                * considered as a PKU violation if all of the
>>> +                * following conditions are true:
>>> +                * 1.CR4_PKE=1.
>>> +                * 2.EFER_LMA=1.
>>> +                * 3.page is present with no reserved bit
>>> +                *   violations.
>>> +                * 4.the access is not an instruction fetch.
>>> +                * 5.the access is to a user page.
>>> +                * 6.PKRU.AD=1
>>> +                *    or The access is a data write and
>>> +                *       PKRU.WD=1 and either CR0.WP=1
>>> +                *       or it is a user access.
>>> +                *
>>> +                * The 2nd and 6th conditions are computed
>>> +                * dynamically in permission_fault.
>>> +                */
>>
>> It is not good as there are branches in the next patch.
>
> It's important to note that branches in general are _not_ a problem.
> Only badly-predicted branches are a problem;

I agreed on this point.
> well-predicted branches are
> _faster_ than branchless code.

Er, i do not understand this. If these two case have the same cache hit,
how can a branch be faster?

> For example, take is_last_gpte.  The
> branchy way to write it in walk_addr_generic would be (excluding the
> 32-bit !PSE case) something like:
>
> 	do {
> 	} while (level > PT_PAGE_TABLE_LEVEL &&
> 		 (!(gpte & PT_PAGE_SIZE_MASK) ||
> 		  level == mmu->root_level));
>
> Here none of the branches is easily predicted, so we want to get rid of
> them.
>
> The next patch adds three branches, and they are not all equal:
>
> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it
> should be well predicted if the host is not overcommitted).
>

But, in the production, cpu over-commit is the normal case...

> - pkru != 0 should be well-predicted to false, at least for a few
> years... and perhaps even later considering that most MMIO access
> happens in the kernel.
>
> - !wf || (!uf && !is_write_protection(vcpu)) is badly predicted and
> should be removed
>
> So only the last one is a problem.
>
>> However, i do not think we need a new byte index for PK. The conditions
>> detecting PK enablement
>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S access).
>
> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
> too, though Huaitong's patch doesn't do that).  It's a good thing to do.
>   U/S is also handled by adding a new byte index, see Huaitong's

It is not on the same page, the U/S is the type of memory access which
is depended on vCPU runtime. But the condition whether PKEY is enabled or not
is fully depended on the envorment of CPU and we should _always_ check PKEY
even if PFEC_PKEY is not set.

As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal
KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request.
Wasting a bit is really unnecessary.

And it is always better to move more workload from permission_fault() to
update_permission_bitmask() as the former is much hotter than the latter.

>
> 	pku = cr4_pku && !ff && u;
>
> If this is improved to
>
> 	pku = cr4_pku && long_mode_vcpu && !ff && u;
>
> one branch goes away in permission_fault.  The read_pkru() branch, if
> well predicted, lets you optimize away the pkru tests.  I think it
> _would_ be well predicted, so I think it should remain.
>
> The "(!wf || (!uf && !is_write_protection(vcpu)))" is indeed the worst
> of the three.  I was lenient in my previous review because this code
> won't run on any system being sold now and in the next 1-2 (?) years.
> However, we can indeed get rid of the branch, so let's do it. :)
>
> I don't like the idea of making permissions[] four times larger.

Okay, then lets introduce a new field for PKEY separately. Your approach
, fault_u1w0, looks good to me.

> Instead, we can find the value of the expression elsewhere in
> mmu->permissions (!), or cache it separately in a different field of mmu.
>
> If I interpret the rules correctly, WD works like this.  First, we take
> the PTE and imagine that it had W=0.  Then, if this access would not
> fault, WD is ignored.  This is because:
>
> - on reads, WD is always ignored
>
> - on writes, WD is ignored in supervisor mode if !CR0.WP
>
> ... and this is how W=0 page work, isn't it?
>

Yes, it is.

> If so, I think it's something like this in code:
>
> -		if (!wf || (!uf && !is_write_protection(vcpu)))
> -			pkru_bits &= ~(1 << PKRU_WRITE);
> +		/* Only testing writes, so ignore SMAP and fetch.  */
> +		pfec_uw = pfec & (PFERR_WRITE_MASK|PFERR_USER_MASK);
> +		fault_uw = mmu->permissions[pfec_uw >> 1];
> +		/*
> +		 * This page has U=1, so check if a U=1 W=0 page faults
> +		 * on this access; if not ignore WD.
> +		 */
> +		pkru_bits &= ~(1 << PKRU_WRITE) |
> +			(fault_uw >> (ACC_USER_MASK - PKRU_WRITE));
>

This is trick and finally i understand it, yeah, it works. :) Except
i do not think PFEC.PKEY should be taken to index as i explained above.

> I think I even prefer if update_permission_bitmask sets up a separate
> bitmask:
>
> 		mmu->fault_u1w0 |= (wf && !w) << byte;
>
> and then this other bitmap can be tested in permission_fault:
>
>
> -		if (!wf || (!uf && !is_write_protection(vcpu)))
> -			pkru_bits &= ~(1 << PKRU_WRITE);
> +		/*
> +		 * fault_u1w0 ignores SMAP and PKRU, so use the
> +		 * partially-computed PFEC that we were given.
> +		 */
> +		fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1;
> +		pkru_bits &= ~(1 << PKRU_WRITE) |
> +			(fault_uw << PKRU_WRITE);
>

It looks good to me!