From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paolo Bonzini Subject: Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys Date: Tue, 8 Mar 2016 09:29:49 +0100 Message-ID: <56DE8D7D.5010302@redhat.com> References: <1457177252-7577-1-git-send-email-huaitong.han@intel.com> <1457177252-7577-4-git-send-email-huaitong.han@intel.com> <56DBDF57.4030607@linux.intel.com> <56DCB9BF.4020904@redhat.com> <56DE80B8.40900@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: kvm@vger.kernel.org To: Xiao Guangrong , Huaitong Han , gleb@kernel.org Return-path: Received: from mail-wm0-f66.google.com ([74.125.82.66]:32861 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932157AbcCHI34 (ORCPT ); Tue, 8 Mar 2016 03:29:56 -0500 Received: by mail-wm0-f66.google.com with SMTP id n186so2683590wmn.0 for ; Tue, 08 Mar 2016 00:29:55 -0800 (PST) In-Reply-To: <56DE80B8.40900@linux.intel.com> Sender: kvm-owner@vger.kernel.org List-ID: On 08/03/2016 08:35, Xiao Guangrong wrote: >> well-predicted branches are _faster_ than branchless code. > > Er, i do not understand this. If these two case have the same cache hit, > how can a branch be faster? Because branchless code typically executes fewer instructions. Take the same example here: >> do { >> } while (level > PT_PAGE_TABLE_LEVEL && >> (!(gpte & PT_PAGE_SIZE_MASK) || >> level == mmu->root_level)); The assembly looks like (assuming %level, %gpte and %mmu are registers) cmp $1, %level jbe 1f test $128, %gpte jz beginning_of_loop cmpb ROOT_LEVEL_OFFSET(%mmu), %level je beginning_of_loop 1: These are two to six instructions, with no dependency and which the processor can change into one to three macro-ops. For the branchless code (I posted a patch to implement this algorithm yesterday): lea -2(%level), %temp1 orl %temp1, %gpte movzbl LAST_NONLEAF_LEVEL_OFFSET(%mmu), %temp1 movl %level, %temp2 subl %temp1, %temp2 andl %temp2, %gpte test $128, %gpte jz beginning_of_loop These are eight instructions, with some dependencies between them too. In some cases branchless code throws away the result of 10-15 instructions (because in the end it's ANDed with 0, for example). If it weren't for mispredictions, the branchy code would be faster. >> Here none of the branches is easily predicted, so we want to get rid of >> them. >> >> The next patch adds three branches, and they are not all equal: >> >> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it >> should be well predicted if the host is not overcommitted). > > But, in the production, cpu over-commit is the normal case... It depends on the workload. I would guess that 32-bit OSes are more common where you have a single legacy guest because e.g. it doesn't have drivers for recent hardware. >>> However, i do not think we need a new byte index for PK. The conditions >>> detecting PK enablement >>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S >>> access). >> >> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA >> too, though Huaitong's patch doesn't do that). It's a good thing to do. >> U/S is also handled by adding a new byte index, see Huaitong's > > It is not on the same page, the U/S is the type of memory access which > is depended on vCPU runtime. Do you mean the type of page (ACC_USER_MASK)? Only U=1 pages are subject to PKRU, even in the kernel. The processor CPL (PFERR_USER_MASK) only matters if CR0.WP=0. > But the condition whether PKEY is enabled or not > is fully depended on the envorment of CPU and we should _always_ > check PKEY even if PFEC_PKEY is not set. > > As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal > KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request. > Wasting a bit is really unnecessary. > > And it is always better to move more workload from permission_fault() to > update_permission_bitmask() as the former is much hotter than the latter. I agree, but I'm not sure why you say that adding a bits adds more work to permission_fault(). Adding a bit lets us skip CR4.PKU and EFER.LMA checks in permission_fault() and in all gva_to_gpa() callers. So my proposal is to compute the "effective" PKRU bits (i.e. extract the relevant AD and WD bits, and mask away WD if irrelevant) in update_permission_bitmask(), and add PFERR_PK_MASK to the error code if they are nonzero. PFERR_PK_MASK must be computed in permission_fault(). It's a runtime condition that it's not known before. >> I don't like the idea of making permissions[] four times larger. > > Okay, then lets introduce a new field for PKEY separately. Your approach > , fault_u1w0, looks good to me. > >> I think I even prefer if update_permission_bitmask sets up a separate >> bitmask: >> >> mmu->fault_u1w0 |= (wf && !w) << byte; >> >> and then this other bitmap can be tested in permission_fault: >> >> - if (!wf || (!uf && !is_write_protection(vcpu))) >> - pkru_bits &= ~(1 << PKRU_WRITE); >> + /* >> + * fault_u1w0 ignores SMAP and PKRU, so use the >> + * partially-computed PFEC that we were given. >> + */ >> + fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1; >> + pkru_bits &= ~(1 << PKRU_WRITE) | >> + (fault_uw << PKRU_WRITE); > > It looks good to me! Good. Thanks for reviewing the idea! Paolo