From mboxrd@z Thu Jan 1 00:00:00 1970 From: Xiao Guangrong Subject: Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys Date: Tue, 8 Mar 2016 17:19:26 +0800 Message-ID: <56DE991E.8070007@linux.intel.com> References: <1457177252-7577-1-git-send-email-huaitong.han@intel.com> <1457177252-7577-4-git-send-email-huaitong.han@intel.com> <56DBDF57.4030607@linux.intel.com> <56DCB9BF.4020904@redhat.com> <56DE80B8.40900@linux.intel.com> <56DE8D7D.5010302@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org To: Paolo Bonzini , Huaitong Han , gleb@kernel.org Return-path: Received: from mga02.intel.com ([134.134.136.20]:60182 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932360AbcCHJTu (ORCPT ); Tue, 8 Mar 2016 04:19:50 -0500 In-Reply-To: <56DE8D7D.5010302@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: On 03/08/2016 04:29 PM, Paolo Bonzini wrote: > > > On 08/03/2016 08:35, Xiao Guangrong wrote: >>> well-predicted branches are _faster_ than branchless code. >> >> Er, i do not understand this. If these two case have the same cache hit, >> how can a branch be faster? > > Because branchless code typically executes fewer instructions. > > Take the same example here: > >>> do { >>> } while (level > PT_PAGE_TABLE_LEVEL && >>> (!(gpte & PT_PAGE_SIZE_MASK) || >>> level == mmu->root_level)); > > The assembly looks like (assuming %level, %gpte and %mmu are registers) > > cmp $1, %level > jbe 1f > test $128, %gpte > jz beginning_of_loop > cmpb ROOT_LEVEL_OFFSET(%mmu), %level > je beginning_of_loop > 1: > > These are two to six instructions, with no dependency and which the > processor can change into one to three macro-ops. For the branchless > code (I posted a patch to implement this algorithm yesterday): > > lea -2(%level), %temp1 > orl %temp1, %gpte > movzbl LAST_NONLEAF_LEVEL_OFFSET(%mmu), %temp1 > movl %level, %temp2 > subl %temp1, %temp2 > andl %temp2, %gpte > test $128, %gpte > jz beginning_of_loop > > These are eight instructions, with some dependencies between them too. > In some cases branchless code throws away the result of 10-15 > instructions (because in the end it's ANDed with 0, for example). If it > weren't for mispredictions, the branchy code would be faster. > Good lesson, thank you, Paolo. :) >>> Here none of the branches is easily predicted, so we want to get rid of >>> them. >>> >>> The next patch adds three branches, and they are not all equal: >>> >>> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it >>> should be well predicted if the host is not overcommitted). >> >> But, in the production, cpu over-commit is the normal case... > > It depends on the workload. I would guess that 32-bit OSes are more > common where you have a single legacy guest because e.g. it doesn't have > drivers for recent hardware. > >>>> However, i do not think we need a new byte index for PK. The conditions >>>> detecting PK enablement >>>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S >>>> access). >>> >>> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA >>> too, though Huaitong's patch doesn't do that). It's a good thing to do. >>> U/S is also handled by adding a new byte index, see Huaitong's >> >> It is not on the same page, the U/S is the type of memory access which >> is depended on vCPU runtime. > > Do you mean the type of page (ACC_USER_MASK)? Only U=1 pages are > subject to PKRU, even in the kernel. The processor CPL > (PFERR_USER_MASK) only matters if CR0.WP=0. No. The index is: | Byte index: page fault error code [4:1] So, the type i mentioned is the type of memory access issued by CPU, e,g CPU is writing the memory or CPU is executing on the memory. > >> But the condition whether PKEY is enabled or not >> is fully depended on the envorment of CPU and we should _always_ >> check PKEY even if PFEC_PKEY is not set. >> >> As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal >> KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request. >> Wasting a bit is really unnecessary. >> >> And it is always better to move more workload from permission_fault() to >> update_permission_bitmask() as the former is much hotter than the latter. > > I agree, but I'm not sure why you say that adding a bits adds more work > to permission_fault(). A branch to check PFEC.PKEY, which is not well predictable on soft mmu. (It should always be set in EPT as the page table walking is done by software, however, if we only consider EPT we can assume it is always true). > > Adding a bit lets us skip CR4.PKU and EFER.LMA checks in > permission_fault() and in all gva_to_gpa() callers. The point is when we can clear this bit to skip these checks. We should _always_ check PKEY even if PFEC.PKEY = 0, because: 1) all gva_to_gpa()s issued by KVM should always check PKEY. This is the case of ept only. 2) if the feature is enabled in softmmu, shadow page table may change its behavior, for example, the mmio-access causes a reserved PF which may clear PFEC.PKEY. And skipping these checks is not really necessary as we can take them into account when we update the bitmask. > > So my proposal is to compute the "effective" PKRU bits (i.e. extract the > relevant AD and WD bits, and mask away WD if irrelevant) in > update_permission_bitmask(), and add PFERR_PK_MASK to the error code if > they are nonzero. > > PFERR_PK_MASK must be computed in permission_fault(). It's a runtime > condition that it's not known before. > Yes, you are right.