From mboxrd@z Thu Jan 1 00:00:00 1970 From: Xiao Guangrong Subject: Re: KVM: MMU: Tracking guest writes through EPT entries ? Date: Sun, 02 Sep 2012 21:29:38 +0800 Message-ID: <50435F42.7030308@linux.vnet.ibm.com> References: <501747A1.6000105@linux.vnet.ibm.com> <503F3EE0.6080502@linux.vnet.ibm.com> <5040277F.5080503@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org To: "Hui Lin (Hugo)" Return-path: Received: from e23smtp01.au.ibm.com ([202.81.31.143]:55847 "EHLO e23smtp01.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752063Ab2IBN3q (ORCPT ); Sun, 2 Sep 2012 09:29:46 -0400 Received: from /spool/local by e23smtp01.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sun, 2 Sep 2012 23:28:33 +1000 Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97]) by d23relay04.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q82DKjWp30343246 for ; Sun, 2 Sep 2012 23:20:45 +1000 Received: from d23av03.au.ibm.com (loopback [127.0.0.1]) by d23av03.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q82DTeU0031391 for ; Sun, 2 Sep 2012 23:29:40 +1000 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On 09/01/2012 05:30 AM, Hui Lin (Hugo) wrote: > On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong > wrote: >> On 08/31/2012 02:59 AM, Hugo wrote: >>> On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong >>> wrote: >>>> On 08/28/2012 11:30 AM, Felix wrote: >>>>> Xiao Guangrong linux.vnet.ibm.com> writes: >>>>> >>>>>> >>>>>> On 07/31/2012 01:18 AM, Sunil wrote: >>>>>>> Hello List, >>>>>>> >>>>>>> I am a KVM newbie and studying KVM mmu code. >>>>>>> >>>>>>> On the existing guest, I am trying to track all guest writes by >>>>>>> marking page table entry as read-only in EPT entry [ I am using Intel >>>>>>> machine with vmx and ept support ]. Looks like EPT support re-uses >>>>>>> shadow page table(SPT) code and hence some of SPT routines. >>>>>>> >>>>>>> I was thinking of below possible approach. Use pte_list_walk() to >>>>>>> traverse through list of sptes and use mmu_spte_update() to flip the >>>>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list; >>>>>>> but on separate lists (based on gfn, page level, memory_slot). So, >>>>>>> recording all the faulted guest GFN and then using above method work ? >>>>>>> >>>>>> >>>>>> There are two ways to write-protect all sptes: >>>>>> - use kvm_mmu_slot_remove_write_access() on all memslots >>>>>> - walk the shadow page cache to get the shadow pages in the highest level >>>>>> (level = 4 on EPT), then write-protect its entries. >>>>>> >>>>>> If you just want to do it for the specified gfn, you can use >>>>>> rmap_write_protect(). >>>>>> >>>>>> Just inquisitive, what is your purpose? :) >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in >>>>>> the body of a message to majordomo vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>> Hi, Guangrong, >>>>> >>>>> I have done similar things like Sunil did. Simply for study purpose. However, I >>>>> found some very weird situations. Basically, in the guest vm, I allocate a chunk >>>>> of memory (with size of a page) in a user level program. Through a guest kernel >>>>> level module and my self defined hypercall, I pass the gva of this memory to >>>>> kvm. Then I try different methods in the hypercall handler to write protect this >>>>> page of memory. You can see that I want to write protect it through ETP instead >>>>> of write protected in the guest page tables. >>>>> >>>>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the >>>>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to >>>>> read sptep (the pointer to spte) instead of spte, so I can modify the spte >>>>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think >>>>> this is the lowest level page table entry corresponding to EPT table; I can >>>>> successfully modify it as the changes are reflected in the result of calling >>>>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still >>>>> write to this page. >>>>> >>>>> In your this blog post, you mentioned (the shadow pages in the highest level >>>>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to >>>>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both >>>>> can cause vmexit. So I am totally confused about the meaning of level used in >>>>> shadow page table and its relations to shadow page table. Can you help me to >>>>> understand this? >>>>> >>>>> 2. As suggested by this post, I also use rmap_write_protect() to write protect >>>>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see >>>>> that spte[0] gives me xxxxxx005 such result, this means that the function is >>>>> called successfully. But still I can write to this page. >>>>> >>>>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of >>>>> spte[0], but I still can write to this page. So I am further confused about the >>>>> level used in the shadow page? >>>>> >>>> >>>> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock) >>>> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs? >>> >>> I do apply the lock in my codes and I do flush tlb. >>> >>>> >>>> If it can not work, please post your code. >>>> >>> >>> Here is my codes. The modifications are made in x86/x86.c in >>> >>> KVM_HC_HL_EPTPER is my hypercall number. >>> >>> Method 1: >>> >>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >>> ................ >>> >>> case KVM_HC_HL_EPTPER : >>> //// This method is not working >>> >>> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >>> if(localGpa == UNMAPPED_GVA){ >>> printk("read is not correct\n"); >>> return -KVM_ENOSYS; >>> } >>> >>> hl_kvm_mmu_update_spte(vcpu, localGpa, 5); >>> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >>> hl_sptes); >>> >>> printk("after changes return result is %d , gpa: %llx >>> sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa, >>> hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]); >>> kvm_flush_remote_tlbs(vcpu->kvm); >>> ................... >>> } >>> >>> The function hl_kvm_mmu_update_spte is defined as >>> >>> int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask) >>> { >>> struct kvm_shadow_walk_iterator iterator; >>> int nr_sptes = 0; >>> u64 sptes[4]; >>> u64* sptep[4]; >>> u64 localMask = 0xFFFFFFFFFFFFFFF8; /// 1000 >>> >>> spin_lock(&vcpu->kvm->mmu_lock); >>> for_each_shadow_entry(vcpu, addr, iterator) { >>> sptes[iterator.level-1] = *iterator.sptep; >>> sptep[iterator.level-1] = iterator.sptep; >>> nr_sptes++; >>> if (!is_shadow_present_pte(*iterator.sptep)) >>> break; >>> } >>> >>> sptes[0] = sptes[0] & localMask; >>> sptes[0] = sptes[0] | mask ; >>> __set_spte(sptep[0], sptes[0]); >>> //update_spte(sptep[0], sptes[0]); >>> /* >>> sptes[1] = sptes[1] & localMask; >>> sptes[1] = sptes[1] | mask ; >>> update_spte(sptep[1], sptes[1]); >>> */ >>> /* >>> >>> sptes[3] = sptes[3] & localMask; >>> sptes[3] = sptes[3] | mask ; >>> update_spte(sptep[3], sptes[3]); >>> */ >>> spin_unlock(&vcpu->kvm->mmu_lock); >>> >>> return nr_sptes; >>> } >>> >>> The execution results are from kern.log >>> >>> xxxx kernel: [ 4371.002579] hypercall f002, a71000 >>> xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa: >>> 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007 >>> >>> I find that if I write to this page, actually the write protected >>> permission bit is set as writable again. I am not quite sure why. >>> >>> Method 2: >>> >>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >>> ................ >>> >>> case KVM_HC_HL_EPTPER : >>> //// This method is not working >>> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >>> localGfn = gpa_to_gfn(localGpa); >>> >>> spin_lock(&vcpu->kvm->mmu_lock); >>> hl_result = rmap_write_protect(vcpu->kvm, localGfn); >>> printk("local gfn is %llx , result of kvm_age_hva is >>> %d\n", localGfn, hl_result); >>> kvm_flush_remote_tlbs(vcpu->kvm); >>> spin_unlock(&vcpu->kvm->mmu_lock); >>> >>> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >>> hl_sptes); >>> printk("return result is %d , gpa: %llx sptes: %llx , >>> %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1], >>> hl_sptes[2], hl_sptes[3]); >>> ................... >>> } >>> >>> The execution results are: >>> >>> xxxx kernel: [ 4044.020816] hypercall f002, 1201000 >>> xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1 >>> xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes: >>> 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007 >>> >>> My feeling is seems that I have to modify something else instead of spte alone. >> >> Aha. >> >> There two issues i found: >> >> - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since >> if the page in guest is readonly, it will trigger COW and switch to a new page >> >> - you also need to do some work on page fault path to avoid setting W bit on the spte >> > > Thanks for the quick reply. > > BTW, I am using KVM 2.6.32.27 kernel module. And use virt-manager as > the guest module. The host is Ubuntu 10.04 with kernel 2.6.32.33. > > I have changed to use kvm_mmu_gva_to_gpa_write function. > > I am also putting extra printk message into page_fault, > tdp_page_fault, and inject_page_fault, functions, none of them gives Could you show these change please? > me any information if I write to the memory whose spte is changed as > readonly. I also try to trace when the __set_spte is called after I Try to add some debug message in mmu_spte_set and mmu_spte_update > modify the spte. I still don't get any luck. So I really want to know > where the problem is. As Davidlohr mentions, this is a basic technique > that I found in many papers, that is why I used it as a study case. You'd better show what you did in the guest OS. > > There is another experiment that I am doing. It is said in the > comments of the code that : Page fault handler will be triggered by > "normal guest page fault due to the guest pte marked not present, not > writable, or not executable" (FNAME(page_fault) function in the > paging_tmpl.h). I have use mprotect system call in my user program in > the guest OS to set the guest page as readonly, and write to this > page. In Linux kernel, this is handle by the seg fault. Actually > page_fault is not called in the kvm. I don't get it, why kvm wants to cat /sys/module/kvm_intel/parameters/ept, if it is 'Y', it is normal. if N, what you see is out of my mind. :) > interfere with the guest page fault and force it to vm exit. I believe > there is a performance issue in theory. If ept/npt is used, kvm does not care the #PF in guest, FNAME(page_fault) is used for ept/npt unsupported.