From mboxrd@z Thu Jan  1 00:00:00 1970
From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Subject: Re: KVM: MMU: Tracking guest writes through EPT entries ?
Date: Mon, 03 Sep 2012 14:11:19 +0800
Message-ID: <50444A07.8060102@linux.vnet.ibm.com>
References: <CALSEb2cFMwAXu_jSuY6GQn6R9C4BbTrMqtbNPQmdrCrPU-+swA@mail.gmail.com> <501747A1.6000105@linux.vnet.ibm.com> <loom.20120828T050655-538@post.gmane.org> <503F3EE0.6080502@linux.vnet.ibm.com> <CAKq214m14svOVnmv5gJGowuNEcvPOP0aAHKL93x0GyG4=fsd2w@mail.gmail.com> <5040277F.5080503@linux.vnet.ibm.com> <CAKq214=sMvNP2wdrbfcAXv+WYC=WftCJKPsSvkRtt3cnrfjVXw@mail.gmail.com> <bb786815f6c14144acc31b8041486282@CITESHT1.ad.uillinois.edu> <CAKq214myDLvZXNqJhOorCfDX5YmXMiBqTV5-LryOC7p++wiGPQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/mixed;
 boundary="------------040801070001040606070505"
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
To: Hugo <hugolin615@gmail.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from e23smtp06.au.ibm.com ([202.81.31.148]:48920 "EHLO
	e23smtp06.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755030Ab2ICGLZ (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 3 Sep 2012 02:11:25 -0400
Received: from /spool/local
	by e23smtp06.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted
	for <kvm@vger.kernel.org> from <xiaoguangrong@linux.vnet.ibm.com>;
	Mon, 3 Sep 2012 16:10:22 +1000
Received: from d23av02.au.ibm.com (d23av02.au.ibm.com [9.190.235.138])
	by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q8362FOL23462140
	for <kvm@vger.kernel.org>; Mon, 3 Sep 2012 16:02:15 +1000
Received: from d23av02.au.ibm.com (loopback [127.0.0.1])
	by d23av02.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q836BK5h024755
	for <kvm@vger.kernel.org>; Mon, 3 Sep 2012 16:11:20 +1000
In-Reply-To: <CAKq214myDLvZXNqJhOorCfDX5YmXMiBqTV5-LryOC7p++wiGPQ@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

This is a multi-part message in MIME format.
--------------040801070001040606070505
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

On 09/03/2012 10:09 AM, Hugo wrote:
> On Sun, Sep 2, 2012 at 8:29 AM, Xiao Guangrong
> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>> On 09/01/2012 05:30 AM, Hui Lin (Hugo) wrote:
>>> On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong
>>> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>>>> On 08/31/2012 02:59 AM, Hugo wrote:
>>>>> On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong
>>>>> <xiaoguangrong@linux.vnet.ibm.com> wrote:
>>>>>> On 08/28/2012 11:30 AM, Felix wrote:
>>>>>>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes:
>>>>>>>
>>>>>>>>
>>>>>>>> On 07/31/2012 01:18 AM, Sunil wrote:
>>>>>>>>> Hello List,
>>>>>>>>>
>>>>>>>>> I am a KVM newbie and studying KVM mmu code.
>>>>>>>>>
>>>>>>>>> On the existing guest, I am trying to track all guest writes by
>>>>>>>>> marking page table entry as read-only in EPT entry [ I am using Intel
>>>>>>>>> machine with vmx and ept support ]. Looks like EPT support re-uses
>>>>>>>>> shadow page table(SPT) code and hence some of SPT routines.
>>>>>>>>>
>>>>>>>>> I was thinking of below possible approach. Use pte_list_walk() to
>>>>>>>>> traverse through list of sptes and use mmu_spte_update()  to flip the
>>>>>>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list;
>>>>>>>>> but on separate lists (based on gfn, page level, memory_slot). So,
>>>>>>>>> recording all the faulted guest GFN and then using above method work ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> There are two ways to write-protect all sptes:
>>>>>>>> - use kvm_mmu_slot_remove_write_access() on all memslots
>>>>>>>> - walk the shadow page cache to get the shadow pages in the highest level
>>>>>>>>   (level = 4 on EPT), then write-protect its entries.
>>>>>>>>
>>>>>>>> If you just want to do it for the specified gfn, you can use
>>>>>>>> rmap_write_protect().
>>>>>>>>
>>>>>>>> Just inquisitive, what is your purpose? :)
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>>>>>> the body of a message to majordomo <at> vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>> Hi, Guangrong,
>>>>>>>
>>>>>>> I have done similar things like Sunil did. Simply for study purpose. However, I
>>>>>>> found some very weird situations. Basically, in the guest vm, I allocate a chunk
>>>>>>> of memory (with size of a page) in a user level program. Through a guest kernel
>>>>>>> level module and my self defined hypercall, I pass the gva of this memory to
>>>>>>> kvm. Then I try different methods in the hypercall handler to write protect this
>>>>>>> page of memory. You can see that I want to write protect it through ETP instead
>>>>>>> of write protected in the guest page tables.
>>>>>>>
>>>>>>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the
>>>>>>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to
>>>>>>> read sptep (the pointer to spte) instead of spte, so I can modify the spte
>>>>>>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think
>>>>>>> this is the lowest level page table entry corresponding to EPT table; I can
>>>>>>> successfully modify it as the changes are reflected in the result of calling
>>>>>>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still
>>>>>>> write to this page.
>>>>>>>
>>>>>>> In your this blog post, you mentioned (the shadow pages in the highest level
>>>>>>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to
>>>>>>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both
>>>>>>> can cause vmexit. So I am totally confused about the meaning of level used in
>>>>>>> shadow page table and its relations to shadow page table. Can you help me to
>>>>>>> understand this?
>>>>>>>
>>>>>>> 2. As suggested by this post, I also use rmap_write_protect() to write protect
>>>>>>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see
>>>>>>> that spte[0] gives me xxxxxx005 such result, this means that the function is
>>>>>>> called successfully. But still I can write to this page.
>>>>>>>
>>>>>>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of
>>>>>>> spte[0], but I still can write to this page. So I am further confused about the
>>>>>>> level used in the shadow page?
>>>>>>>
>>>>>>
>>>>>> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock)
>>>>>> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs?
>>>>>
>>>>> I do apply the lock in my codes and I do flush tlb.
>>>>>
>>>>>>
>>>>>> If it can not work, please post your code.
>>>>>>
>>>>>
>>>>> Here is my codes. The modifications are made in x86/x86.c in
>>>>>
>>>>> KVM_HC_HL_EPTPER is my hypercall number.
>>>>>
>>>>> Method 1:
>>>>>
>>>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){
>>>>>                    ................
>>>>>
>>>>> case KVM_HC_HL_EPTPER :
>>>>>                 //// This method is not working
>>>>>
>>>>>                 localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx);
>>>>>                 if(localGpa == UNMAPPED_GVA){
>>>>>                         printk("read is not correct\n");
>>>>>                         return -KVM_ENOSYS;
>>>>>                 }
>>>>>
>>>>>                 hl_kvm_mmu_update_spte(vcpu, localGpa, 5);
>>>>>                 hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
>>>>> hl_sptes);
>>>>>
>>>>>                 printk("after changes return result is %d , gpa: %llx
>>>>> sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa,
>>>>> hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]);
>>>>>                 kvm_flush_remote_tlbs(vcpu->kvm);
>>>>>                  ...................
>>>>> }
>>>>>
>>>>> The function hl_kvm_mmu_update_spte is defined as
>>>>>
>>>>> int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask)
>>>>> {
>>>>>         struct kvm_shadow_walk_iterator iterator;
>>>>>         int nr_sptes = 0;
>>>>>         u64 sptes[4];
>>>>>         u64* sptep[4];
>>>>>         u64 localMask = 0xFFFFFFFFFFFFFFF8;   /// 1000
>>>>>
>>>>>         spin_lock(&vcpu->kvm->mmu_lock);
>>>>>         for_each_shadow_entry(vcpu, addr, iterator) {
>>>>>                 sptes[iterator.level-1] = *iterator.sptep;
>>>>>                 sptep[iterator.level-1] = iterator.sptep;
>>>>>                 nr_sptes++;
>>>>>                 if (!is_shadow_present_pte(*iterator.sptep))
>>>>>                         break;
>>>>>         }
>>>>>
>>>>>         sptes[0] = sptes[0] & localMask;
>>>>>         sptes[0] = sptes[0] | mask ;
>>>>>         __set_spte(sptep[0], sptes[0]);
>>>>>         //update_spte(sptep[0], sptes[0]);
>>>>> /*
>>>>>         sptes[1] = sptes[1] & localMask;
>>>>>         sptes[1] = sptes[1] | mask ;
>>>>>         update_spte(sptep[1], sptes[1]);
>>>>> */
>>>>> /*
>>>>>
>>>>>         sptes[3] = sptes[3] & localMask;
>>>>>         sptes[3] = sptes[3] | mask ;
>>>>>         update_spte(sptep[3], sptes[3]);
>>>>> */
>>>>>         spin_unlock(&vcpu->kvm->mmu_lock);
>>>>>
>>>>>         return nr_sptes;
>>>>> }
>>>>>
>>>>> The execution results are from kern.log
>>>>>
>>>>> xxxx kernel: [ 4371.002579] hypercall f002, a71000
>>>>> xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa:
>>>>> 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007
>>>>>
>>>>> I find that if I write to this page, actually the write protected
>>>>> permission bit is set as writable again. I am not quite sure why.
>>>>>
>>>>> Method 2:
>>>>>
>>>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){
>>>>>                    ................
>>>>>
>>>>> case KVM_HC_HL_EPTPER :
>>>>>                 //// This method is not working
>>>>>                 localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx);
>>>>>                 localGfn = gpa_to_gfn(localGpa);
>>>>>
>>>>>                 spin_lock(&vcpu->kvm->mmu_lock);
>>>>>                 hl_result = rmap_write_protect(vcpu->kvm, localGfn);
>>>>>                 printk("local gfn is %llx , result of kvm_age_hva is
>>>>> %d\n", localGfn, hl_result);
>>>>>                 kvm_flush_remote_tlbs(vcpu->kvm);
>>>>>                 spin_unlock(&vcpu->kvm->mmu_lock);
>>>>>
>>>>>                 hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
>>>>> hl_sptes);
>>>>>                 printk("return result is %d , gpa: %llx sptes: %llx ,
>>>>> %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1],
>>>>> hl_sptes[2], hl_sptes[3]);
>>>>>                  ...................
>>>>> }
>>>>>
>>>>> The execution results are:
>>>>>
>>>>> xxxx kernel: [ 4044.020816] hypercall f002, 1201000
>>>>> xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1
>>>>> xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes:
>>>>> 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007
>>>>>
>>>>> My feeling is seems that I have to modify something else instead of spte alone.
>>>>
>>>> Aha.
>>>>
>>>> There two issues i found:
>>>>
>>>> - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since
>>>>   if the page in guest is readonly, it will trigger COW and switch to a new page
>>>>
>>>> - you also need to do some work on page fault path to avoid setting W bit on the spte
>>>>
>>>
>>> Thanks for the quick reply.
>>>
>>> BTW, I am using KVM 2.6.32.27 kernel module. And use virt-manager as
>>> the guest module. The host is Ubuntu 10.04 with kernel 2.6.32.33.
>>>
>>> I have changed to use kvm_mmu_gva_to_gpa_write function.
>>>
>>> I am also putting extra printk message into page_fault,
>>> tdp_page_fault, and inject_page_fault, functions, none of them gives
>>
>> Could you show these change please?
> 
> What I did in tdp_page_fault and inject_page_fault is simple,
> 
> In tdp_page_fault, inject_page_fault, I added the same piece of codes
> at the beginning of the function. The target_gpa is set in x86/x86.c
> by the vmcall handler:
> 
>         /////
>         if(gpa == target_gpa){
>                 printk("XXXX Debug %llx \n", gpa);
>         }
>         /////
> 
> This way, no crazy kernel logs are made.
>>
>>> me any information if I write to the memory whose spte is changed as
>>> readonly. I also try to trace when the __set_spte is called after I
>>
>> Try to add some debug message in mmu_spte_set and mmu_spte_update
>>
>>> modify the spte. I still don't get any luck. So I really want to know
>>> where the problem is. As Davidlohr mentions, this is a basic technique
>>> that I found in many papers, that is why I used it as a study case.
>>
>> You'd better show what you did in the guest OS.
> What I did in Guest OS includes two parts:
> kernel level: pseudo device driver, includes read and write function.
> The write function accept the virtual address defined in a user
> program. And then pass this virtual address to the KVM through vmcall.
> This is basic device driver module introduced in linux device driver.
> Guest level:
> I allocate a page of memory in the program's address space:
>         pagesize = sysconf(_SC_PAGE_SIZE);
>         if(pagesize == -1){
>                 printf("sysconf error\n");
>                 return -1;
>         }
>         //buffer = (char*)memalign(pagesize, pagesize);
>         ori = (char*)malloc(1024 + pagesize - 1);
>         if (ori == NULL){
>                 printf("memalign\n");
>                 return -1;
>         }
>         buffer = (char *)(((int) ori + pagesize -1) & ~(pagesize-1));
>         address = (unsigned long) buffer;
> Then pass the "address " to the kernel module:
> 
> size = write(fd, &address, sizeof(unsigned long));

Okay, i have written a test case, it works fine on my box. If it can work
on you machine, you can easily find out what is wrong in your code, if not,
please let me know.

The code is attached.




--------------040801070001040606070505
Content-Type: text/plain; charset=UTF-8;
 name="diff.test"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="diff.test"

ZGlmZiAtLWdpdCBhL2FyY2gveDg2L2t2bS9tbXUuYyBiL2FyY2gveDg2L2t2bS9tbXUuYwpp
bmRleCBjOWI2ZTMzLi4zZTMyZjliIDEwMDY0NAotLS0gYS9hcmNoL3g4Ni9rdm0vbW11LmMK
KysrIGIvYXJjaC94ODYva3ZtL21tdS5jCkBAIC0xMjAwLDcgKzEyMDAsNyBAQCB2b2lkIGt2
bV9tbXVfd3JpdGVfcHJvdGVjdF9wdF9tYXNrZWQoc3RydWN0IGt2bSAqa3ZtLAogCX0KIH0K
IAotc3RhdGljIGJvb2wgcm1hcF93cml0ZV9wcm90ZWN0KHN0cnVjdCBrdm0gKmt2bSwgdTY0
IGdmbikKK2Jvb2wgcm1hcF93cml0ZV9wcm90ZWN0KHN0cnVjdCBrdm0gKmt2bSwgdTY0IGdm
bikKIHsKIAlzdHJ1Y3Qga3ZtX21lbW9yeV9zbG90ICpzbG90OwogCXVuc2lnbmVkIGxvbmcg
KnJtYXBwOwpAQCAtMzI5Niw2ICszMjk2LDggQEAgc3RhdGljIGJvb2wgdHJ5X2FzeW5jX3Bm
KHN0cnVjdCBrdm1fdmNwdSAqdmNwdSwgYm9vbCBwcmVmYXVsdCwgZ2ZuX3QgZ2ZuLAogCXJl
dHVybiBmYWxzZTsKIH0KIAorZ2ZuX3QgZmlsdGVyX2dmbjsKKwogc3RhdGljIGludCB0ZHBf
cGFnZV9mYXVsdChzdHJ1Y3Qga3ZtX3ZjcHUgKnZjcHUsIGd2YV90IGdwYSwgdTMyIGVycm9y
X2NvZGUsCiAJCQkgIGJvb2wgcHJlZmF1bHQpCiB7CkBAIC0zMzExLDYgKzMzMTMsMTEgQEAg
c3RhdGljIGludCB0ZHBfcGFnZV9mYXVsdChzdHJ1Y3Qga3ZtX3ZjcHUgKnZjcHUsIGd2YV90
IGdwYSwgdTMyIGVycm9yX2NvZGUsCiAJQVNTRVJUKHZjcHUpOwogCUFTU0VSVChWQUxJRF9Q
QUdFKHZjcHUtPmFyY2gubW11LnJvb3RfaHBhKSk7CiAKKwlpZiAoZmlsdGVyX2dmbiAmJiAo
ZmlsdGVyX2dmbiA9PSBncGFfdG9fZ2ZuKGdwYSkpKSB7CisJCXByaW50aygiQ2F0Y2ggZ2Zu
ICVsbHguXG4iLCBmaWx0ZXJfZ2ZuKTsKKwkJcmV0dXJuIDE7CisJfQorCiAJaWYgKHVubGlr
ZWx5KGVycm9yX2NvZGUgJiBQRkVSUl9SU1ZEX01BU0spKQogCQlyZXR1cm4gaGFuZGxlX21t
aW9fcGFnZV9mYXVsdCh2Y3B1LCBncGEsIGVycm9yX2NvZGUsIHRydWUpOwogCmRpZmYgLS1n
aXQgYS9hcmNoL3g4Ni9rdm0veDg2LmMgYi9hcmNoL3g4Ni9rdm0veDg2LmMKaW5kZXggZDQ0
ZWRhYS4uZDNlMjY2YyAxMDA2NDQKLS0tIGEvYXJjaC94ODYva3ZtL3g4Ni5jCisrKyBiL2Fy
Y2gveDg2L2t2bS94ODYuYwpAQCAtMTc1OSw2ICsxNzU5LDI0IEBAIGludCBrdm1fc2V0X21z
cl9jb21tb24oc3RydWN0IGt2bV92Y3B1ICp2Y3B1LCB1MzIgbXNyLCB1NjQgZGF0YSkKIAkJ
CXJldHVybiAxOwogCQl2Y3B1LT5hcmNoLm9zdncuc3RhdHVzID0gZGF0YTsKIAkJYnJlYWs7
CisJY2FzZSAweDk5OTk5OTk5OiB7CisJCWV4dGVybiBib29sIHJtYXBfd3JpdGVfcHJvdGVj
dChzdHJ1Y3Qga3ZtICprdm0sIHU2NCBnZm4pOworCQlleHRlcm4gZ2ZuX3QgZmlsdGVyX2dm
bjsKKwkKKwkJZ3BhX3QgZ3BhID0gIGt2bV9tbXVfZ3ZhX3RvX2dwYV93cml0ZSh2Y3B1LCBk
YXRhLCBOVUxMKTsKKwkJaWYgKGdwYSA9PSBVTk1BUFBFRF9HVkEpIHsKKwkJCXByaW50aygi
dW5tYXBwZWQgZ3ZhOiVsbHguXG4iLCBkYXRhKTsKKwkJfQorCisJCXByaW50aygiR1ZBICVs
bHggLT4gR1BBOiVsbHguXG4iLCBkYXRhLCBncGEpOworCQlmaWx0ZXJfZ2ZuID0gZ3BhX3Rv
X2dmbihncGEpOworCQlzcGluX2xvY2soJnZjcHUtPmt2bS0+bW11X2xvY2spOworCQlpZiAo
cm1hcF93cml0ZV9wcm90ZWN0KHZjcHUtPmt2bSwgZmlsdGVyX2dmbikpCisJCQlrdm1fZmx1
c2hfcmVtb3RlX3RsYnModmNwdS0+a3ZtKTsKKwkJc3Bpbl91bmxvY2soJnZjcHUtPmt2bS0+
bW11X2xvY2spOworCX0KKwlicmVhazsKKwogCWRlZmF1bHQ6CiAJCWlmIChtc3IgJiYgKG1z
ciA9PSB2Y3B1LT5rdm0tPmFyY2gueGVuX2h2bV9jb25maWcubXNyKSkKIAkJCXJldHVybiB4
ZW5faHZtX2NvbmZpZyh2Y3B1LCBkYXRhKTsK
--------------040801070001040606070505
Content-Type: text/x-csrc;
 name="main.c"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="main.c"

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/string.h>

static int __init test_init(void)
{
	unsigned long addr = __get_free_page(GFP_KERNEL);

	if (!addr)
		return -ENOMEM;

	printk("va:%lx.\n", addr);
	wrmsrl(0x99999999, addr);

	strcpy((char *)addr, "KVMKVM");
	printk("addr %s.\n", (char *)addr);

	free_page(addr);
	return 0;
}

static void __exit test_exit(void)
{
}

MODULE_LICENSE("GPL");

module_init(test_init);
module_exit(test_exit);

--------------040801070001040606070505--