All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paolo Bonzini <pbonzini@redhat.com>
To: Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Huaitong Han <huaitong.han@intel.com>,
	gleb@kernel.org
Cc: kvm@vger.kernel.org
Subject: Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
Date: Mon, 7 Mar 2016 00:14:07 +0100	[thread overview]
Message-ID: <56DCB9BF.4020904@redhat.com> (raw)
In-Reply-To: <56DBDF57.4030607@linux.intel.com>



On 06/03/2016 08:42, Xiao Guangrong wrote:
>>
>> +        rsvdf = pfec & PFERR_RSVD_MASK;
> 
> No. RSVD is reserved by SMAP and it should not be used to walk guest
> page page table.

Agreed.  You can treat your code as if rsvdf was always false.  Reserved
bits are handled elsewhere.

>> +        pkuf = pfec & PFERR_PK_MASK;
>>           /*
>>            * PFERR_RSVD_MASK bit is set in PFEC if the access is not
>>            * subject to SMAP restrictions, and cleared otherwise. The
>> @@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct
>> kvm_vcpu *vcpu,
>>                    *   clearer.
>>                    */
>>                   smap = cr4_smap && u && !uf && !ff;
>> +
>> +                /*
>> +                * PKU:additional mechanism by which the paging
>> +                * controls access to user-mode addresses based
>> +                * on the value in the PKRU register. A fault is
>> +                * considered as a PKU violation if all of the
>> +                * following conditions are true:
>> +                * 1.CR4_PKE=1.
>> +                * 2.EFER_LMA=1.
>> +                * 3.page is present with no reserved bit
>> +                *   violations.
>> +                * 4.the access is not an instruction fetch.
>> +                * 5.the access is to a user page.
>> +                * 6.PKRU.AD=1
>> +                *    or The access is a data write and
>> +                *       PKRU.WD=1 and either CR0.WP=1
>> +                *       or it is a user access.
>> +                *
>> +                * The 2nd and 6th conditions are computed
>> +                * dynamically in permission_fault.
>> +                */
> 
> It is not good as there are branches in the next patch.

It's important to note that branches in general are _not_ a problem.
Only badly-predicted branches are a problem; well-predicted branches are
_faster_ than branchless code.  For example, take is_last_gpte.  The
branchy way to write it in walk_addr_generic would be (excluding the
32-bit !PSE case) something like:

	do {
	} while (level > PT_PAGE_TABLE_LEVEL &&
		 (!(gpte & PT_PAGE_SIZE_MASK) ||
		  level == mmu->root_level));

Here none of the branches is easily predicted, so we want to get rid of
them.

The next patch adds three branches, and they are not all equal:

- is_long_vcpu is well predicted to true (or even for 32-bit OSes it
should be well predicted if the host is not overcommitted).

- pkru != 0 should be well-predicted to false, at least for a few
years... and perhaps even later considering that most MMIO access
happens in the kernel.

- !wf || (!uf && !is_write_protection(vcpu)) is badly predicted and
should be removed

So only the last one is a problem.

> However, i do not think we need a new byte index for PK. The conditions
> detecting PK enablement
> can be fully found in current vcpu content (i.e, CR4, EFER and U/S access).

Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
too, though Huaitong's patch doesn't do that).  It's a good thing to do.
 U/S is also handled by adding a new byte index, see Huaitong's

	pku = cr4_pku && !ff && u;

If this is improved to

	pku = cr4_pku && long_mode_vcpu && !ff && u;

one branch goes away in permission_fault.  The read_pkru() branch, if
well predicted, lets you optimize away the pkru tests.  I think it
_would_ be well predicted, so I think it should remain.

The "(!wf || (!uf && !is_write_protection(vcpu)))" is indeed the worst
of the three.  I was lenient in my previous review because this code
won't run on any system being sold now and in the next 1-2 (?) years.
However, we can indeed get rid of the branch, so let's do it. :)

I don't like the idea of making permissions[] four times larger.
Instead, we can find the value of the expression elsewhere in
mmu->permissions (!), or cache it separately in a different field of mmu.

If I interpret the rules correctly, WD works like this.  First, we take
the PTE and imagine that it had W=0.  Then, if this access would not
fault, WD is ignored.  This is because:

- on reads, WD is always ignored

- on writes, WD is ignored in supervisor mode if !CR0.WP

... and this is how W=0 page work, isn't it?

If so, I think it's something like this in code:

-		if (!wf || (!uf && !is_write_protection(vcpu)))
-			pkru_bits &= ~(1 << PKRU_WRITE);
+		/* Only testing writes, so ignore SMAP and fetch.  */
+		pfec_uw = pfec & (PFERR_WRITE_MASK|PFERR_USER_MASK);
+		fault_uw = mmu->permissions[pfec_uw >> 1];
+		/*
+		 * This page has U=1, so check if a U=1 W=0 page faults
+		 * on this access; if not ignore WD.
+		 */
+		pkru_bits &= ~(1 << PKRU_WRITE) |
+			(fault_uw >> (ACC_USER_MASK - PKRU_WRITE));

I think I even prefer if update_permission_bitmask sets up a separate
bitmask:

		mmu->fault_u1w0 |= (wf && !w) << byte;

and then this other bitmap can be tested in permission_fault:


-		if (!wf || (!uf && !is_write_protection(vcpu)))
-			pkru_bits &= ~(1 << PKRU_WRITE);
+		/*
+		 * fault_u1w0 ignores SMAP and PKRU, so use the
+		 * partially-computed PFEC that we were given.
+		 */
+		fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1;
+		pkru_bits &= ~(1 << PKRU_WRITE) |
+			(fault_uw << PKRU_WRITE);

These ideas are untested, of course.  I apologize for any mistake.
However, they should apply both to Huaitong's current code (which needs
PFERR_PK_MASK in gva_to_gpa) and to my other suggestion from the reply
to patch 5 (http://article.gmane.org/gmane.comp.emulators.kvm.devel/148311).

Paolo

  reply	other threads:[~2016-03-06 23:14 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
2016-03-05 11:27 ` [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest Huaitong Han
2016-03-06  7:15   ` Xiao Guangrong
2016-03-06 23:20     ` Paolo Bonzini
2016-03-08  7:39       ` Xiao Guangrong
2016-03-08  7:58         ` Paolo Bonzini
2016-03-05 11:27 ` [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode Huaitong Han
2016-03-06  7:19   ` Xiao Guangrong
2016-03-08 12:09   ` Yang Zhang
2016-03-08 12:11     ` Paolo Bonzini
2016-03-08 13:02       ` Yang Zhang
2016-03-05 11:27 ` [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys Huaitong Han
2016-03-06  7:42   ` Xiao Guangrong
2016-03-06 23:14     ` Paolo Bonzini [this message]
2016-03-08  7:35       ` Xiao Guangrong
2016-03-08  8:29         ` Paolo Bonzini
2016-03-08  9:19           ` Xiao Guangrong
2016-03-08 10:01             ` Paolo Bonzini
2016-03-09  5:03               ` Xiao Guangrong
2016-03-09  8:10                 ` Paolo Bonzini
2016-03-05 11:27 ` [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic Huaitong Han
2016-03-06  8:00   ` Xiao Guangrong
2016-03-06 20:36     ` Paolo Bonzini
2016-03-06 23:29       ` Paolo Bonzini
2016-03-08  5:57       ` Xiao Guangrong
2016-03-05 11:27 ` [PATCH V4 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions Huaitong Han
2016-03-06  8:01   ` Xiao Guangrong
2016-03-06 21:33     ` Paolo Bonzini
2016-03-05 11:27 ` [PATCH V4 6/7] KVM, pkeys: add pkeys support for xsave state Huaitong Han
2016-03-06  8:27   ` Xiao Guangrong
2016-03-05 11:27 ` [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept Huaitong Han
2016-03-06  9:28   ` Xiao Guangrong
2016-03-06 20:32     ` Paolo Bonzini
2016-03-08  5:54       ` Xiao Guangrong
2016-03-08  8:47         ` Paolo Bonzini
2016-03-08  9:32           ` Xiao Guangrong
2016-03-08 10:02             ` Paolo Bonzini
2016-03-09  5:51               ` Xiao Guangrong
2016-03-09  6:37                 ` Yang Zhang
2016-03-09  7:21                   ` Xiao Guangrong
2016-03-09  7:41                     ` Yang Zhang
2016-03-09  7:50                       ` Xiao Guangrong
2016-03-09  8:00                         ` Yang Zhang
2016-03-09  8:05                           ` Xiao Guangrong
2016-03-09  8:18                             ` Paolo Bonzini
2016-03-09  8:13                 ` Paolo Bonzini
2016-03-09  6:24           ` Yang Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56DCB9BF.4020904@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=gleb@kernel.org \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=huaitong.han@intel.com \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.