From mboxrd@z Thu Jan  1 00:00:00 1970
From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Subject: Re: KVM: MMU: Tracking guest writes through EPT entries ?
Date: Sun, 02 Sep 2012 21:29:38 +0800
Message-ID: <50435F42.7030308@linux.vnet.ibm.com>
References: <CALSEb2cFMwAXu_jSuY6GQn6R9C4BbTrMqtbNPQmdrCrPU-+swA@mail.gmail.com> <501747A1.6000105@linux.vnet.ibm.com> <loom.20120828T050655-538@post.gmane.org> <503F3EE0.6080502@linux.vnet.ibm.com> <CAKq214m14svOVnmv5gJGowuNEcvPOP0aAHKL93x0GyG4=fsd2w@mail.gmail.com> <5040277F.5080503@linux.vnet.ibm.com> <CAKq214=sMvNP2wdrbfcAXv+WYC=WftCJKPsSvkRtt3cnrfjVXw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: "Hui Lin (Hugo)" <hlin33@illinois.edu>
Return-path: <kvm-owner@vger.kernel.org>
Received: from e23smtp01.au.ibm.com ([202.81.31.143]:55847 "EHLO
	e23smtp01.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752063Ab2IBN3q (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 2 Sep 2012 09:29:46 -0400
Received: from /spool/local
	by e23smtp01.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted
	for <kvm@vger.kernel.org> from <xiaoguangrong@linux.vnet.ibm.com>;
	Sun, 2 Sep 2012 23:28:33 +1000
Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97])
	by d23relay04.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q82DKjWp30343246
	for <kvm@vger.kernel.org>; Sun, 2 Sep 2012 23:20:45 +1000
Received: from d23av03.au.ibm.com (loopback [127.0.0.1])
	by d23av03.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q82DTeU0031391
	for <kvm@vger.kernel.org>; Sun, 2 Sep 2012 23:29:40 +1000
In-Reply-To: <CAKq214=sMvNP2wdrbfcAXv+WYC=WftCJKPsSvkRtt3cnrfjVXw@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 09/01/2012 05:30 AM, Hui Lin (Hugo) wrote:
> On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong
> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>> On 08/31/2012 02:59 AM, Hugo wrote:
>>> On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong
>>> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>>>> On 08/28/2012 11:30 AM, Felix wrote:
>>>>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes:
>>>>>
>>>>>>
>>>>>> On 07/31/2012 01:18 AM, Sunil wrote:
>>>>>>> Hello List,
>>>>>>>
>>>>>>> I am a KVM newbie and studying KVM mmu code.
>>>>>>>
>>>>>>> On the existing guest, I am trying to track all guest writes by
>>>>>>> marking page table entry as read-only in EPT entry [ I am using Intel
>>>>>>> machine with vmx and ept support ]. Looks like EPT support re-uses
>>>>>>> shadow page table(SPT) code and hence some of SPT routines.
>>>>>>>
>>>>>>> I was thinking of below possible approach. Use pte_list_walk() to
>>>>>>> traverse through list of sptes and use mmu_spte_update()  to flip the
>>>>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list;
>>>>>>> but on separate lists (based on gfn, page level, memory_slot). So,
>>>>>>> recording all the faulted guest GFN and then using above method work ?
>>>>>>>
>>>>>>
>>>>>> There are two ways to write-protect all sptes:
>>>>>> - use kvm_mmu_slot_remove_write_access() on all memslots
>>>>>> - walk the shadow page cache to get the shadow pages in the highest level
>>>>>>   (level = 4 on EPT), then write-protect its entries.
>>>>>>
>>>>>> If you just want to do it for the specified gfn, you can use
>>>>>> rmap_write_protect().
>>>>>>
>>>>>> Just inquisitive, what is your purpose? :)
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>>>> the body of a message to majordomo <at> vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>> Hi, Guangrong,
>>>>>
>>>>> I have done similar things like Sunil did. Simply for study purpose. However, I
>>>>> found some very weird situations. Basically, in the guest vm, I allocate a chunk
>>>>> of memory (with size of a page) in a user level program. Through a guest kernel
>>>>> level module and my self defined hypercall, I pass the gva of this memory to
>>>>> kvm. Then I try different methods in the hypercall handler to write protect this
>>>>> page of memory. You can see that I want to write protect it through ETP instead
>>>>> of write protected in the guest page tables.
>>>>>
>>>>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the
>>>>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to
>>>>> read sptep (the pointer to spte) instead of spte, so I can modify the spte
>>>>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think
>>>>> this is the lowest level page table entry corresponding to EPT table; I can
>>>>> successfully modify it as the changes are reflected in the result of calling
>>>>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still
>>>>> write to this page.
>>>>>
>>>>> In your this blog post, you mentioned (the shadow pages in the highest level
>>>>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to
>>>>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both
>>>>> can cause vmexit. So I am totally confused about the meaning of level used in
>>>>> shadow page table and its relations to shadow page table. Can you help me to
>>>>> understand this?
>>>>>
>>>>> 2. As suggested by this post, I also use rmap_write_protect() to write protect
>>>>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see
>>>>> that spte[0] gives me xxxxxx005 such result, this means that the function is
>>>>> called successfully. But still I can write to this page.
>>>>>
>>>>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of
>>>>> spte[0], but I still can write to this page. So I am further confused about the
>>>>> level used in the shadow page?
>>>>>
>>>>
>>>> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock)
>>>> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs?
>>>
>>> I do apply the lock in my codes and I do flush tlb.
>>>
>>>>
>>>> If it can not work, please post your code.
>>>>
>>>
>>> Here is my codes. The modifications are made in x86/x86.c in
>>>
>>> KVM_HC_HL_EPTPER is my hypercall number.
>>>
>>> Method 1:
>>>
>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){
>>>                    ................
>>>
>>> case KVM_HC_HL_EPTPER :
>>>                 //// This method is not working
>>>
>>>                 localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx);
>>>                 if(localGpa == UNMAPPED_GVA){
>>>                         printk("read is not correct\n");
>>>                         return -KVM_ENOSYS;
>>>                 }
>>>
>>>                 hl_kvm_mmu_update_spte(vcpu, localGpa, 5);
>>>                 hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
>>> hl_sptes);
>>>
>>>                 printk("after changes return result is %d , gpa: %llx
>>> sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa,
>>> hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]);
>>>                 kvm_flush_remote_tlbs(vcpu->kvm);
>>>                  ...................
>>> }
>>>
>>> The function hl_kvm_mmu_update_spte is defined as
>>>
>>> int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask)
>>> {
>>>         struct kvm_shadow_walk_iterator iterator;
>>>         int nr_sptes = 0;
>>>         u64 sptes[4];
>>>         u64* sptep[4];
>>>         u64 localMask = 0xFFFFFFFFFFFFFFF8;   /// 1000
>>>
>>>         spin_lock(&vcpu->kvm->mmu_lock);
>>>         for_each_shadow_entry(vcpu, addr, iterator) {
>>>                 sptes[iterator.level-1] = *iterator.sptep;
>>>                 sptep[iterator.level-1] = iterator.sptep;
>>>                 nr_sptes++;
>>>                 if (!is_shadow_present_pte(*iterator.sptep))
>>>                         break;
>>>         }
>>>
>>>         sptes[0] = sptes[0] & localMask;
>>>         sptes[0] = sptes[0] | mask ;
>>>         __set_spte(sptep[0], sptes[0]);
>>>         //update_spte(sptep[0], sptes[0]);
>>> /*
>>>         sptes[1] = sptes[1] & localMask;
>>>         sptes[1] = sptes[1] | mask ;
>>>         update_spte(sptep[1], sptes[1]);
>>> */
>>> /*
>>>
>>>         sptes[3] = sptes[3] & localMask;
>>>         sptes[3] = sptes[3] | mask ;
>>>         update_spte(sptep[3], sptes[3]);
>>> */
>>>         spin_unlock(&vcpu->kvm->mmu_lock);
>>>
>>>         return nr_sptes;
>>> }
>>>
>>> The execution results are from kern.log
>>>
>>> xxxx kernel: [ 4371.002579] hypercall f002, a71000
>>> xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa:
>>> 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007
>>>
>>> I find that if I write to this page, actually the write protected
>>> permission bit is set as writable again. I am not quite sure why.
>>>
>>> Method 2:
>>>
>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){
>>>                    ................
>>>
>>> case KVM_HC_HL_EPTPER :
>>>                 //// This method is not working
>>>                 localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx);
>>>                 localGfn = gpa_to_gfn(localGpa);
>>>
>>>                 spin_lock(&vcpu->kvm->mmu_lock);
>>>                 hl_result = rmap_write_protect(vcpu->kvm, localGfn);
>>>                 printk("local gfn is %llx , result of kvm_age_hva is
>>> %d\n", localGfn, hl_result);
>>>                 kvm_flush_remote_tlbs(vcpu->kvm);
>>>                 spin_unlock(&vcpu->kvm->mmu_lock);
>>>
>>>                 hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
>>> hl_sptes);
>>>                 printk("return result is %d , gpa: %llx sptes: %llx ,
>>> %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1],
>>> hl_sptes[2], hl_sptes[3]);
>>>                  ...................
>>> }
>>>
>>> The execution results are:
>>>
>>> xxxx kernel: [ 4044.020816] hypercall f002, 1201000
>>> xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1
>>> xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes:
>>> 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007
>>>
>>> My feeling is seems that I have to modify something else instead of spte alone.
>>
>> Aha.
>>
>> There two issues i found:
>>
>> - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since
>>   if the page in guest is readonly, it will trigger COW and switch to a new page
>>
>> - you also need to do some work on page fault path to avoid setting W bit on the spte
>>
> 
> Thanks for the quick reply.
> 
> BTW, I am using KVM 2.6.32.27 kernel module. And use virt-manager as
> the guest module. The host is Ubuntu 10.04 with kernel 2.6.32.33.
> 
> I have changed to use kvm_mmu_gva_to_gpa_write function.
> 
> I am also putting extra printk message into page_fault,
> tdp_page_fault, and inject_page_fault, functions, none of them gives

Could you show these change please?

> me any information if I write to the memory whose spte is changed as
> readonly. I also try to trace when the __set_spte is called after I

Try to add some debug message in mmu_spte_set and mmu_spte_update

> modify the spte. I still don't get any luck. So I really want to know
> where the problem is. As Davidlohr mentions, this is a basic technique
> that I found in many papers, that is why I used it as a study case.

You'd better show what you did in the guest OS.

> 
> There is another experiment that I am doing. It is said in the
> comments of the code that :  Page fault handler will be triggered by
> "normal guest page fault due to the guest pte marked not present, not
> writable, or not executable" (FNAME(page_fault) function in the
> paging_tmpl.h). I have use mprotect system call in my user program in
> the guest OS to set the guest page as readonly, and write to this
> page. In Linux kernel, this is handle by the seg fault. Actually
> page_fault is not called in the kvm. I don't get it, why kvm wants to

cat /sys/module/kvm_intel/parameters/ept, if it is 'Y', it is normal.
if N, what you see is out of my mind. :)

> interfere with the guest page fault and force it to vm exit. I believe
> there is a performance issue in theory.

If ept/npt is used, kvm does not care the #PF in guest, FNAME(page_fault)
is used for ept/npt unsupported.