[RFC] Expose infrastructure for unpinning guest memory

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] Expose infrastructure for unpinning guest memory
@ 2007-10-11 21:32 Anthony Liguori
       [not found] ` <1192138344500-git-send-email-aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Anthony Liguori @ 2007-10-11 21:32 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Anthony Liguori, Avi Kivity

Now that we have userspace memory allocation, I wanted to play with ballooning.
The idea is that when a guest "balloons" down, we simply unpin the underlying
physical memory and the host kernel may or may not swap it.  To reclaim
ballooned memory, the guest can just start using it and we'll pin it on demand.

The following patch is a stab at providing the right infrastructure for pinning
and automatic repinning.  I don't have a lot of comfort in the MMU code so I
thought I'd get some feedback before going much further.

gpa_to_hpa is a little awkward to hook, but it seems like the right place in the
code.  I'm most uncertain about the SMP safety of the unpinning.  Presumably,
I have to hold the kvm lock around the mmu_unshadow and page_cache release to
ensure that another VCPU doesn't fault the page back in after mmu_unshadow?

Feedback would be greatly appreciated!

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 4a52d6e..8abe770 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -409,6 +409,7 @@ struct kvm_memory_slot {
 	unsigned long *rmap;
 	unsigned long *dirty_bitmap;
 	int user_alloc; /* user allocated memory */
+	unsigned long userspace_addr;
 };
 
 struct kvm {
@@ -652,6 +653,7 @@ int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
+int kvm_mmu_unpin(struct kvm *kvm, gfn_t gfn);
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index a0f8366..74105d1 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -774,6 +774,7 @@ static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
 			unsigned long pages_num;
 
 			new.user_alloc = 1;
+			new.userspace_addr = mem->userspace_addr;
 			down_read(&current->mm->mmap_sem);
 
 			pages_num = get_user_pages(current, current->mm,
@@ -1049,12 +1050,36 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
 	struct kvm_memory_slot *slot;
+	struct page *page;
+	uint64_t slot_index;
 
 	gfn = unalias_gfn(kvm, gfn);
 	slot = __gfn_to_memslot(kvm, gfn);
 	if (!slot)
 		return NULL;
-	return slot->phys_mem[gfn - slot->base_gfn];
+
+	slot_index = gfn - slot->base_gfn;
+	page = slot->phys_mem[slot_index];
+	if (unlikely(page == NULL)) {
+		unsigned long pages_num;
+
+		down_read(&current->mm->mmap_sem);
+
+		pages_num = get_user_pages(current, current->mm,
+					   slot->userspace_addr +
+					   (slot_index << PAGE_SHIFT),
+					   1, 1, 0, &slot->phys_mem[slot_index],
+					   NULL);
+
+		up_read(&current->mm->mmap_sem);
+
+		if (pages_num != 1)
+			page = NULL;
+		else
+			page = slot->phys_mem[slot_index];
+	}
+
+	return page;
 }
 EXPORT_SYMBOL_GPL(gfn_to_page);
 
diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c
index f52604a..1820816 100644
--- a/drivers/kvm/mmu.c
+++ b/drivers/kvm/mmu.c
@@ -25,6 +25,7 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/module.h>
+#include <linux/pagemap.h>
 
 #include <asm/page.h>
 #include <asm/cmpxchg.h>
@@ -820,6 +821,33 @@ static void mmu_unshadow(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+int kvm_mmu_unpin(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot;
+	struct page *page;
+
+	/* FIXME for each active vcpu */
+
+	gfn = unalias_gfn(kvm, gfn);
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!gfn)
+		return -EINVAL;
+
+	/* FIXME: do we need to hold a lock here? */
+
+	/* Remove page from shadow MMU and unpin page */
+	mmu_unshadow(kvm, gfn);
+	page = slot->phys_mem[gfn - slot->base_gfn];
+	if (page) {
+		if (!PageReserved(page))
+			SetPageDirty(page);
+		page_cache_release(page);
+		slot->phys_mem[gfn - slot->base_gfn] = NULL;
+	}
+
+	return 0;
+}
+
 static void page_header_update_slot(struct kvm *kvm, void *pte, gpa_t gpa)
 {
 	int slot = memslot_id(kvm, gfn_to_memslot(kvm, gpa >> PAGE_SHIFT));

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC] Expose infrastructure for unpinning guest memory
       [not found] ` <1192138344500-git-send-email-aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-10-11 21:59   ` Izik Eidus
       [not found]     ` <470E9CB6.4030107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-10-12  6:32   ` Avi Kivity
  1 sibling, 1 reply; 9+ messages in thread
From: Izik Eidus @ 2007-10-11 21:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Avi Kivity

Anthony Liguori wrote:
> Now that we have userspace memory allocation, I wanted to play with ballooning.
> The idea is that when a guest "balloons" down, we simply unpin the underlying
> physical memory and the host kernel may or may not swap it.  To reclaim
> ballooned memory, the guest can just start using it and we'll pin it on demand.
>
> The following patch is a stab at providing the right infrastructure for pinning
> and automatic repinning.  I don't have a lot of comfort in the MMU code so I
> thought I'd get some feedback before going much further.
>
> gpa_to_hpa is a little awkward to hook, but it seems like the right place in the
> code.  I'm most uncertain about the SMP safety of the unpinning.  Presumably,
> I have to hold the kvm lock around the mmu_unshadow and page_cache release to
> ensure that another VCPU doesn't fault the page back in after mmu_unshadow?
>
> Feedback would be greatly appreciated!
>
> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
> index 4a52d6e..8abe770 100644
> --- a/drivers/kvm/kvm.h
> +++ b/drivers/kvm/kvm.h
> @@ -409,6 +409,7 @@ struct kvm_memory_slot {
>  	unsigned long *rmap;
>  	unsigned long *dirty_bitmap;
>  	int user_alloc; /* user allocated memory */
> +	unsigned long userspace_addr;
>  };
>  
>  struct kvm {
> @@ -652,6 +653,7 @@ int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
>  void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
>  int kvm_mmu_load(struct kvm_vcpu *vcpu);
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> +int kvm_mmu_unpin(struct kvm *kvm, gfn_t gfn);
>  
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>  
> diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
> index a0f8366..74105d1 100644
> --- a/drivers/kvm/kvm_main.c
> +++ b/drivers/kvm/kvm_main.c
> @@ -774,6 +774,7 @@ static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
>  			unsigned long pages_num;
>  
>  			new.user_alloc = 1;
> +			new.userspace_addr = mem->userspace_addr;
>  			down_read(&current->mm->mmap_sem);
>  
>  			pages_num = get_user_pages(current, current->mm,
> @@ -1049,12 +1050,36 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
>  {
>  	struct kvm_memory_slot *slot;
> +	struct page *page;
> +	uint64_t slot_index;
>  
>  	gfn = unalias_gfn(kvm, gfn);
>  	slot = __gfn_to_memslot(kvm, gfn);
>  	if (!slot)
>  		return NULL;
> -	return slot->phys_mem[gfn - slot->base_gfn];
> +
> +	slot_index = gfn - slot->base_gfn;
> +	page = slot->phys_mem[slot_index];
> +	if (unlikely(page == NULL)) {
> +		unsigned long pages_num;
> +
> +		down_read(&current->mm->mmap_sem);
> +
> +		pages_num = get_user_pages(current, current->mm,
> +					   slot->userspace_addr +
> +					   (slot_index << PAGE_SHIFT),
> +					   1, 1, 0, &slot->phys_mem[slot_index],
> +					   NULL);
> +
> +		up_read(&current->mm->mmap_sem);
> +
> +		if (pages_num != 1)
> +			page = NULL;
> +		else
> +			page = slot->phys_mem[slot_index];
> +	}
> +
> +	return page;
>  }
>  EXPORT_SYMBOL_GPL(gfn_to_page);
>  
> diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c
> index f52604a..1820816 100644
> --- a/drivers/kvm/mmu.c
> +++ b/drivers/kvm/mmu.c
> @@ -25,6 +25,7 @@
>  #include <linux/mm.h>
>  #include <linux/highmem.h>
>  #include <linux/module.h>
> +#include <linux/pagemap.h>
>  
>  #include <asm/page.h>
>  #include <asm/cmpxchg.h>
> @@ -820,6 +821,33 @@ static void mmu_unshadow(struct kvm *kvm, gfn_t gfn)
>  	}
>  }
>  
> +int kvm_mmu_unpin(struct kvm *kvm, gfn_t gfn)
> +{
> +	struct kvm_memory_slot *slot;
> +	struct page *page;
> +
> +	/* FIXME for each active vcpu */
> +
> +	gfn = unalias_gfn(kvm, gfn);
> +	slot = gfn_to_memslot(kvm, gfn);
> +	if (!gfn)
> +		return -EINVAL;
> +
> +	/* FIXME: do we need to hold a lock here? */
> +
> +	/* Remove page from shadow MMU and unpin page */
> +	mmu_unshadow(kvm, gfn);
> +	page = slot->phys_mem[gfn - slot->base_gfn];
> +	if (page) {
> +		if (!PageReserved(page))
> +			SetPageDirty(page);
> +		page_cache_release(page);
> +		slot->phys_mem[gfn - slot->base_gfn] = NULL;
> +	}
> +
> +	return 0;
> +}
> +
>  static void page_header_update_slot(struct kvm *kvm, void *pte, gpa_t gpa)
>  {
>  	int slot = memslot_id(kvm, gfn_to_memslot(kvm, gpa >> PAGE_SHIFT));
>
> -------------------------------------------------------------------------
>   
kvm_memory_slot

heh, i am working on similir patch, and our gfn_to_page and the change 
to  kvm_memory_slot even by varible names :)
few things you have to do to make this work:
make gfn_to_page safe always function (return bad_page in case of 
failure, i have patch for this if you want)
hacking the kvm_read_guest_page / kvm_write_guest_page 
kvm_clear_guest_page to do put_page after the usage of the page
secoend, is hacking the rmap to do reverse mapping to every present pte 
and put_page() the pages at rmap_remove()
and this about all, to make this work.




-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Expose infrastructure for unpinning guest memory
       [not found]     ` <470E9CB6.4030107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-10-11 22:18       ` Anthony Liguori
       [not found]         ` <470EA136.8020100-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Anthony Liguori @ 2007-10-11 22:18 UTC (permalink / raw)
  To: Izik Eidus; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Avi Kivity

Izik Eidus wrote:
>>  static void page_header_update_slot(struct kvm *kvm, void *pte, 
>> gpa_t gpa)
>>  {
>>      int slot = memslot_id(kvm, gfn_to_memslot(kvm, gpa >> PAGE_SHIFT));
>>
>> ------------------------------------------------------------------------- 
>>
>>   
> kvm_memory_slot
>
> heh, i am working on similir patch, and our gfn_to_page and the change 
> to  kvm_memory_slot even by varible names :)

Ah, fantastic :-)  Care to share what you currently have?

> few things you have to do to make this work:
> make gfn_to_page safe always function (return bad_page in case of 
> failure, i have patch for this if you want)

That seems pretty obvious.  No reason not to have that committed now.

> hacking the kvm_read_guest_page / kvm_write_guest_page 
> kvm_clear_guest_page to do put_page after the usage of the page

The idea being that kvm_read_guest_page() will effectively pin the page 
and put_page() has the effect of unpinning it?  It seems to me that we 
should be using page_cache_release()'ing since we're not just 
get_page()'ing the memory.  I may be wrong though.

Both of these are an optimization though.  It's not strictly needed for 
what I'm after since in the case of ballooning, there's no reason why 
someone would be calling kvm_read_guest_page() on the ballooned memory.

>
> secoend, is hacking the rmap to do reverse mapping to every present 
> pte and put_page() the pages at rmap_remove()
> and this about all, to make this work.

If I understand you correctly, this is to unpin the page whenever it is 
removed from the rmap?  That would certainly be useful but it's still an 
optimization.  The other obvious optimization to me would be to not use 
get_user_pages() on all memory to start with and instead, allow pages to 
be faulted in on use.  This is particularly useful for creating a VM 
with a very large amount of memory, and immediately ballooning down.  
That way the large amount of memory doesn't need to be present to actual 
spawn the guest.

Regards,

Anthony Liguori

>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Expose infrastructure for unpinning guest memory
       [not found]         ` <470EA136.8020100-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-10-11 22:31           ` Izik Eidus
  2007-10-12  0:11           ` Dor Laor
  1 sibling, 0 replies; 9+ messages in thread
From: Izik Eidus @ 2007-10-11 22:31 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Avi Kivity

[-- Attachment #1: Type: text/plain, Size: 3766 bytes --]

Anthony Liguori wrote:
> Izik Eidus wrote:
>>>  static void page_header_update_slot(struct kvm *kvm, void *pte, 
>>> gpa_t gpa)
>>>  {
>>>      int slot = memslot_id(kvm, gfn_to_memslot(kvm, gpa >> 
>>> PAGE_SHIFT));
>>>
>>> ------------------------------------------------------------------------- 
>>>
>>>   
>> kvm_memory_slot
>>
>> heh, i am working on similir patch, and our gfn_to_page and the 
>> change to  kvm_memory_slot even by varible names :)
>
> Ah, fantastic :-)  Care to share what you currently have?
here it is :)
>
>> few things you have to do to make this work:
>> make gfn_to_page safe always function (return bad_page in case of 
>> failure, i have patch for this if you want)
>
> That seems pretty obvious.  No reason not to have that committed now.
it is include in the patch that i sent you
>
>> hacking the kvm_read_guest_page / kvm_write_guest_page 
>> kvm_clear_guest_page to do put_page after the usage of the page
>
> The idea being that kvm_read_guest_page() will effectively pin the 
> page and put_page() has the effect of unpinning it?  It seems to me 
> that we should be using page_cache_release()'ing since we're not just 
> get_page()'ing the memory.  I may be wrong though.
>
> Both of these are an optimization though.  It's not strictly needed 
> for what I'm after since in the case of ballooning, there's no reason 
> why someone would be calling kvm_read_guest_page() on the ballooned 
> memory.
ohhh, gfn_to_page do get_page to the pages ( this is called by 
get_user_pages automaticly), this is the only way the system can make 
sure the page wont be swapped by when you are using it
and if we will insert swapped page to the guest, we will have memory 
corraption...
therefor each page that we get by gfn_to_page must be put_paged after 
using it
to make it easy, gfn_to_page should do get_page even on normal kernel 
allocated pages
(btw you have nothing to worry about, if the page is swapped, 
get_users_pages walk on the pte and get it out for us)

>
>>
>> secoend, is hacking the rmap to do reverse mapping to every present 
>> pte and put_page() the pages at rmap_remove()
>> and this about all, to make this work.
>
> If I understand you correctly, this is to unpin the page whenever it 
> is removed from the rmap?  That would certainly be useful but it's 
> still an optimization.  The other obvious optimization to me would be 
> to not use get_user_pages() on all memory to start with and instead, 
> allow pages to be faulted in on use.  This is particularly useful for 
> creating a VM with a very large amount of memory, and immediately 
> ballooning down.  That way the large amount of memory doesn't need to 
> be present to actual spawn the guest.
>
we must to call get_user_pages, beacuse each page that we dont hold the 
refernec (page->_count) can point to diffrent virtual address any moment....
infact with this way, in kvmctl, we can remove the memset(...) on all 
the memory ( i did this just beacuse of some lazyness/copy on write/how 
you want to call this, mechanisem that linux have), beacuse now each 
call to gfn_to_page return the right virtual address of the physical 
guest page.

the patch is here,
all what needed to make it work with swapping is runing rmap on EVERY 
present pages ( now it run on just writeable pages, witch mean that 
other pages are not protected from being swapped )
you can try silly swapping, by removing the put_page from rmap_remove 
and from set_pte_common() function

btw, i didnt some ugly thing now to get you the patch, so i dont sure if 
it will be applied, or some parts might be missing, i am sorry for this, 
but i have no time to check it now
i will do cleanup to the patchs when the rmap will be ready...
> Regards,
>
> Anthony Liguori
>
>>
>>
>


[-- Attachment #2: swapping.patch --]
[-- Type: text/x-patch, Size: 12887 bytes --]

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 4ab487c..e7df8fc 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -409,6 +409,7 @@ struct kvm_memory_slot {
 	unsigned long *rmap;
 	unsigned long *dirty_bitmap;
 	int user_alloc; /* user allocated memory */
+	unsigned long userspace_addr;
 };
 
@@ -561,8 +562,9 @@ static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; }
 hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva);
 struct page *gva_to_page(struct kvm_vcpu *vcpu, gva_t gva);
 
-extern hpa_t bad_page_address;
+extern struct page *bad_page;
 
+int is_error_page(struct page *page);
 gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn);
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 0b2894a..dde8497 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -325,13 +325,13 @@ static void kvm_free_userspace_physmem(struct kvm_memory_slot *free)
 {
 	int i;
 
-	for (i = 0; i < free->npages; ++i) {
+	/*for (i = 0; i < free->npages; ++i) {
 		if (free->phys_mem[i]) {
 			if (!PageReserved(free->phys_mem[i]))
 				SetPageDirty(free->phys_mem[i]);
 			page_cache_release(free->phys_mem[i]);
 		}
-	}
+	}*/
 }
 
@@ -773,19 +773,8 @@ static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
 		memset(new.phys_mem, 0, npages * sizeof(struct page *));
 		memset(new.rmap, 0, npages * sizeof(*new.rmap));
 		if (user_alloc) {
-			unsigned long pages_num;
-
 			new.user_alloc = 1;
-			down_read(&current->mm->mmap_sem);
-
-			pages_num = get_user_pages(current, current->mm,
-						   mem->userspace_addr,
-						   npages, 1, 0, new.phys_mem,
-						   NULL);
-
-			up_read(&current->mm->mmap_sem);
-			if (pages_num != npages)
-				goto out_unlock;
+			new.userspace_addr = kvm_user_memory_region;
 		} else {
 			for (i = 0; i < npages; ++i) {
 				new.phys_mem[i] = alloc_page(GFP_HIGHUSER
@@ -1014,6 +1003,12 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
 	return r;
 }
 
+int is_error_page(struct page *page)
+{
+	return page == bad_page;
+}
+EXPORT_SYMBOL_GPL(is_error_page);
+
 gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn)
 {
 	int i;
@@ -1054,8 +1049,27 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 
 	gfn = unalias_gfn(kvm, gfn);
 	slot = __gfn_to_memslot(kvm, gfn);
-	if (!slot)
-		return NULL;
+	if (!slot) {
+		get_page(bad_page);
+		return bad_page;
+	}
+	if (slot->user_alloc) {
+ 		struct page *page[1];
+ 		int npages;
+ 
+ 		down_read(&current->mm->mmap_sem);
+ 		npages = get_user_pages(current, current->mm,
+ 					slot->userspace_addr
+ 					+ (gfn - slot->base_gfn) * PAGE_SIZE, 1,
+ 					1, 0, page, NULL);
+ 		up_read(&current->mm->mmap_sem);
+		if (npages != 1) {
+			get_page(bad_page);
+			return bad_page;
+		}
+ 		return page[0];
+ 	}
+	get_page(slot->phys_mem[gfn - slot->base_gfn]);
 	return slot->phys_mem[gfn - slot->base_gfn];
 }
 EXPORT_SYMBOL_GPL(gfn_to_page);
@@ -1075,13 +1089,16 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 	struct page *page;
 
 	page = gfn_to_page(kvm, gfn);
-	if (!page)
+	if (is_error_page(page)) {
+		put_page(page);
 		return -EFAULT;
+	}
 	page_virt = kmap_atomic(page, KM_USER0);
 
 	memcpy(data, page_virt + offset, len);
 
 	kunmap_atomic(page_virt, KM_USER0);
+	put_page(page);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
@@ -1113,14 +1130,17 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 	struct page *page;
 
 	page = gfn_to_page(kvm, gfn);
-	if (!page)
+	if (is_error_page(page)) {
+		put_page(page);
 		return -EFAULT;
+	}
 	page_virt = kmap_atomic(page, KM_USER0);
 
 	memcpy(page_virt + offset, data, len);
 
 	kunmap_atomic(page_virt, KM_USER0);
 	mark_page_dirty(kvm, gfn);
+	put_page(page);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
@@ -1151,13 +1171,16 @@ int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 	struct page *page;
 
 	page = gfn_to_page(kvm, gfn);
-	if (!page)
+	if (is_error_page(page)) {
+		put_page(page);
 		return -EFAULT;
+	}
 	page_virt = kmap_atomic(page, KM_USER0);
 
 	memset(page_virt + offset, 0, len);
 
 	kunmap_atomic(page_virt, KM_USER0);
+	put_page(page);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest_page);
@@ -3300,9 +3322,10 @@ static struct page *kvm_vm_nopage(struct vm_area_struct *vma,
 
 	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 	page = gfn_to_page(kvm, pgoff);
-	if (!page)
+	if (is_error_page(page)) {
+		put_page(page);
 		return NOPAGE_SIGBUS;
-	get_page(page);
+	}
 	if (type != NULL)
 		*type = VM_FAULT_MINOR;
@@ -3642,7 +3665,7 @@ static struct sys_device kvm_sysdev = {
 	.cls = &kvm_sysdev_class,
 };
 
-hpa_t bad_page_address;
+struct page *bad_page;
 
 static __init int kvm_init(void)
 {
-	static struct page *bad_page;
 	int r;
 
 	r = kvm_mmu_module_init();
@@ -3782,16 +3802,11 @@ static __init int kvm_init(void)
 
 	kvm_init_msr_list();
 
-	bad_page = alloc_page(GFP_KERNEL);
-
-	if (bad_page == NULL) {
+	if ((bad_page = alloc_page(GFP_KERNEL | __GFP_ZERO)) == NULL) {
 		r = -ENOMEM;
 		goto out;
 	}
 
-	bad_page_address = page_to_pfn(bad_page) << PAGE_SHIFT;
-	memset(__va(bad_page_address), 0, PAGE_SIZE);
-
 	return 0;
 
 out:
@@ -3804,9 +3819,12 @@ out4:
 static __exit void kvm_exit(void)
 {
 	kvm_exit_debug();
-	__free_page(pfn_to_page(bad_page_address >> PAGE_SHIFT));
+	__free_page(bad_page);
 	kvm_mmu_module_exit();
 }
 
 module_init(kvm_init)
 module_exit(kvm_exit)
diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c
index 382bd6a..7669af1 100644
--- a/drivers/kvm/mmu.c
+++ b/drivers/kvm/mmu.c
@@ -421,6 +421,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 		return;
 	page = page_header(__pa(spte));
 	rmapp = gfn_to_rmap(kvm, page->gfns[spte - page->spt]);
+	put_page(pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT));
 	if (!*rmapp) {
 		printk(KERN_ERR "rmap_remove: %p %llx 0->BUG\n", spte, *spte);
 		BUG();
@@ -823,23 +824,17 @@ static void page_header_update_slot(struct kvm *kvm, void *pte, gpa_t gpa)
 	__set_bit(slot, &page_head->slot_bitmap);
 }
 
-hpa_t safe_gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa)
-{
-	hpa_t hpa = gpa_to_hpa(vcpu, gpa);
-
-	return is_error_hpa(hpa) ? bad_page_address | (gpa & ~PAGE_MASK): hpa;
-}
-
 hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
 	struct page *page;
+	hpa_t hpa;
 
 	ASSERT((gpa & HPA_ERR_MASK) == 0);
 	page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT);
-	if (!page)
-		return gpa | HPA_ERR_MASK;
-	return ((hpa_t)page_to_pfn(page) << PAGE_SHIFT)
-		| (gpa & (PAGE_SIZE-1));
+	hpa = ((hpa_t)page_to_pfn(page) << PAGE_SHIFT) | (gpa & (PAGE_SIZE-1));
+	if (is_error_page(page))
+		return hpa | HPA_ERR_MASK;
+	return hpa;
 }
 
 hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva)
@@ -886,6 +881,9 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, hpa_t p)
 			table[index] = p | PT_PRESENT_MASK | PT_WRITABLE_MASK |
 								PT_USER_MASK;
 			rmap_add(vcpu, &table[index], v >> PAGE_SHIFT);
+			if(!is_rmap_pte(table[index]))
+				put_page(pfn_to_page((v & PT64_BASE_ADDR_MASK)
+						     >> PAGE_SHIFT));
 			return 0;
 		}
 
@@ -900,6 +898,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, hpa_t p)
 						     1, 0, &table[index]);
 			if (!new_table) {
 				pgprintk("nonpaging_map: ENOMEM\n");
+				put_page(pfn_to_page((v & PT64_BASE_ADDR_MASK)
+						     >> PAGE_SHIFT));
 				return -ENOMEM;
 			}
 
@@ -1014,8 +1014,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 
 	paddr = gpa_to_hpa(vcpu , addr & PT64_BASE_ADDR_MASK);
 
-	if (is_error_hpa(paddr))
+	if (is_error_hpa(paddr)) {
+		put_page(pfn_to_page((paddr & PT64_BASE_ADDR_MASK)
+				     >> PAGE_SHIFT));
 		return 1;
+	}
 
 	return nonpaging_map(vcpu, addr & PAGE_MASK, paddr);
 }
@@ -1489,8 +1492,7 @@ static void audit_mappings_page(struct kvm_vcpu *vcpu, u64 page_pte,
 				printk(KERN_ERR "xx audit error: (%s) levels %d"
 				       " gva %lx gpa %llx hpa %llx ent %llx %d\n",
 				       audit_msg, vcpu->mmu.root_level,
-				       va, gpa, hpa, ent,
-				       is_shadow_present_pte(ent));
+				       va, gpa, hpa, ent, is_shadow_present_pte(ent));
 			else if (ent == shadow_notrap_nonpresent_pte
 				 && !is_error_hpa(hpa))
 				printk(KERN_ERR "audit: (%s) notrap shadow,"
diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h
index 447d2c3..da02742 100644
--- a/drivers/kvm/paging_tmpl.h
+++ b/drivers/kvm/paging_tmpl.h
@@ -76,8 +76,6 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 			    struct kvm_vcpu *vcpu, gva_t addr,
 			    int write_fault, int user_fault, int fetch_fault)
 {
-	hpa_t hpa;
-	struct kvm_memory_slot *slot;
 	pt_element_t *ptep;
 	pt_element_t root;
 	gfn_t table_gfn;
@@ -102,9 +100,8 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 	walker->table_gfn[walker->level - 1] = table_gfn;
 	pgprintk("%s: table_gfn[%d] %lx\n", __FUNCTION__,
 		 walker->level - 1, table_gfn);
-	slot = gfn_to_memslot(vcpu->kvm, table_gfn);
-	hpa = safe_gpa_to_hpa(vcpu, root & PT64_BASE_ADDR_MASK);
-	walker->page = pfn_to_page(hpa >> PAGE_SHIFT);
+	walker->page = gfn_to_page(vcpu->kvm, (root & PT64_BASE_ADDR_MASK)
+				   >> PAGE_SHIFT);
 	walker->table = kmap_atomic(walker->page, KM_USER0);
 
 	ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) ||
@@ -114,7 +111,6 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 
 	for (;;) {
 		int index = PT_INDEX(addr, walker->level);
-		hpa_t paddr;
 
 		ptep = &walker->table[index];
 		walker->index = index;
@@ -159,17 +155,19 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 		walker->inherited_ar &= walker->table[index];
 		table_gfn = (*ptep & PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
 		kunmap_atomic(walker->table, KM_USER0);
-		paddr = safe_gpa_to_hpa(vcpu, table_gfn << PAGE_SHIFT);
-		walker->page = pfn_to_page(paddr >> PAGE_SHIFT);
+		put_page(walker->page);
+		walker->page = gfn_to_page(vcpu->kvm, table_gfn);
 		walker->table = kmap_atomic(walker->page, KM_USER0);
 		--walker->level;
-		walker->table_gfn[walker->level - 1] = table_gfn;
+		walker->table_gfn[walker->level - 1 ] = table_gfn;
 		pgprintk("%s: table_gfn[%d] %lx\n", __FUNCTION__,
 			 walker->level - 1, table_gfn);
 	}
 	walker->pte = *ptep;
-	if (walker->page)
+	if (walker->page) {
 		walker->ptep = NULL;
+		put_page(walker->page);
+	}
 	if (walker->table)
 		kunmap_atomic(walker->table, KM_USER0);
 	pgprintk("%s: pte %llx\n", __FUNCTION__, (u64)*ptep);
@@ -191,6 +189,8 @@ err:
 		walker->error_code |= PFERR_FETCH_MASK;
 	if (walker->table)
 		kunmap_atomic(walker->table, KM_USER0);
+	if (walker->page)
+		put_page(walker->page);
 	return 0;
 }
 
@@ -222,18 +222,23 @@ static void FNAME(set_pte_common)(struct kvm_vcpu *vcpu,
 		 write_fault, user_fault, gfn);
 
 	if (write_fault && !dirty) {
+		struct page *page = NULL;
+
 		pt_element_t *guest_ent, *tmp = NULL;
 
 		if (walker->ptep)
 			guest_ent = walker->ptep;
 		else {
-			tmp = kmap_atomic(walker->page, KM_USER0);
+			page = gfn_to_page(vcpu->kvm, walker->table_gfn[walker->level - 1]);
+			tmp = kmap_atomic(page, KM_USER0);
 			guest_ent = &tmp[walker->index];
 		}
 
 		*guest_ent |= PT_DIRTY_MASK;
-		if (!walker->ptep)
+		if (!walker->ptep) {
 			kunmap_atomic(tmp, KM_USER0);
+			put_page(page);
+		}
 		dirty = 1;
 		FNAME(mark_pagetable_dirty)(vcpu->kvm, walker);
 	}
@@ -257,6 +262,8 @@ static void FNAME(set_pte_common)(struct kvm_vcpu *vcpu,
 	if (is_error_hpa(paddr)) {
 		set_shadow_pte(shadow_pte,
 			       shadow_trap_nonpresent_pte | PT_SHADOW_IO_MARK);
+		put_page(pfn_to_page((paddr & PT64_BASE_ADDR_MASK) 
+				     >> PAGE_SHIFT));
 		return;
 	}
 
@@ -294,9 +301,16 @@ unshadowed:
 	pgprintk("%s: setting spte %llx\n", __FUNCTION__, spte);
 	set_shadow_pte(shadow_pte, spte);
 	page_header_update_slot(vcpu->kvm, shadow_pte, gaddr);
-	if (!was_rmapped)
+	if (!was_rmapped) {
 		rmap_add(vcpu, shadow_pte, (gaddr & PT64_BASE_ADDR_MASK)
 			 >> PAGE_SHIFT);
+		if (!is_rmap_pte(*shadow_pte))
+			put_page(pfn_to_page((paddr & PT64_BASE_ADDR_MASK)
+					     >> PAGE_SHIFT));
+	}
+	else
+		put_page(pfn_to_page((paddr & PT64_BASE_ADDR_MASK)
+				     >> PAGE_SHIFT));
 	if (!ptwrite || !*ptwrite)
 		vcpu->last_pte_updated = shadow_pte;
 }
@@ -518,19 +532,22 @@ static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu,
 {
 	int i;
 	pt_element_t *gpt;
+	struct page *page;
 
 	if (sp->role.metaphysical || PTTYPE == 32) {
 		nonpaging_prefetch_page(vcpu, sp);
 		return;
 	}
 
-	gpt = kmap_atomic(gfn_to_page(vcpu->kvm, sp->gfn), KM_USER0);
+	page = gfn_to_page(vcpu->kvm, sp->gfn);
+	gpt = kmap_atomic(page, KM_USER0);
 	for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
 		if (is_present_pte(gpt[i]))
 			sp->spt[i] = shadow_trap_nonpresent_pte;
 		else
 			sp->spt[i] = shadow_notrap_nonpresent_pte;
 	kunmap_atomic(gpt, KM_USER0);
+	put_page(page);
 }

[-- Attachment #3: Type: text/plain, Size: 314 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

[-- Attachment #4: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC] Expose infrastructure for unpinning guest memory
       [not found]         ` <470EA136.8020100-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-10-11 22:31           ` Izik Eidus
@ 2007-10-12  0:11           ` Dor Laor
  1 sibling, 0 replies; 9+ messages in thread
From: Dor Laor @ 2007-10-12  0:11 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Avi Kivity

[-- Attachment #1: Type: text/plain, Size: 2265 bytes --]


> The idea being that kvm_read_guest_page() will effectively pin the page 
> and put_page() has the effect of unpinning it?  It seems to me that we 
> should be using page_cache_release()'ing since we're not just 
> get_page()'ing the memory.  I may be wrong though.
>
> Both of these are an optimization though.  It's not strictly needed for 
> what I'm after since in the case of ballooning, there's no reason why 
> someone would be calling kvm_read_guest_page() on the ballooned memory.
>
>   
>> secoend, is hacking the rmap to do reverse mapping to every present 
>> pte and put_page() the pages at rmap_remove()
>> and this about all, to make this work.
>>     
>
> If I understand you correctly, this is to unpin the page whenever it is 
> removed from the rmap?  That would certainly be useful but it's still an 
> optimization.  The other obvious optimization to me would be to not use 
> get_user_pages() on all memory to start with and instead, allow pages to 
> be faulted in on use.  This is particularly useful for creating a VM 
> with a very large amount of memory, and immediately ballooning down.  
> That way the large amount of memory doesn't need to be present to actual 
> spawn the guest.
>
> Regards,
>
> Anthony Liguori
>
>   
Izik idea is towards general guest swapping capability. The first step 
is just to increase the
reference count of the rmapped pages. The second is to change the size 
of the shadow pages
table as function of the guest memory usage and the third is to get 
notifications from Linux
about pte state changes.
btw: I have an unmerged balloon code (guest & host) with the old kernel 
mapping.
The guest part may be still valid for the userspace allocation.
Attaching it.
Dor.
>>     
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


[-- Attachment #2: kvm_balloon.c --]
[-- Type: text/plain, Size: 7151 bytes --]

/*
 * KVM guest balloon driver
 *
 * Copyright (C) 2007, Qumranet, Inc., Dor Laor <dor.laor-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
 *
 * This work is licensed under the terms of the GNU GPL, version 2.  See
 * the COPYING file in the top-level directory.
 */

#include "../kvm.h"
#include <linux/kvm_para.h>
#include <linux/kvm.h>

#include <asm/hypercall.h>
#include <asm/uaccess.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/percpu.h>
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/version.h>
#include <linux/miscdevice.h>

MODULE_AUTHOR ("Dor Laor");
MODULE_DESCRIPTION ("Implements guest ballooning support");
MODULE_LICENSE("GPL");
MODULE_VERSION("1");

#define KVM_BALLOON_MINOR MISC_DYNAMIC_MINOR

static LIST_HEAD(balloon_plist);
static int balloon_size = 0;
static DEFINE_SPINLOCK(balloon_plist_lock);
static 	gfn_t balloon_shared_gfn;

struct balloon_page {
	struct page *bpage;
	struct list_head bp_list;
};

static int kvm_trigger_balloon_op(int npages)
{
	unsigned long ret;

	ret = kvm_hypercall2(__NR_hypercall_balloon, balloon_shared_gfn, npages);
	WARN_ON(ret);
	printk(KERN_DEBUG "%s:hypercall ret: %lx\n", __FUNCTION__, ret);

	return ret;

}

static int kvm_balloon_inflate(unsigned long *shared_page_addr, int npages)
{
	LIST_HEAD(tmp_list);
	struct balloon_page *node, *tmp;
	u32 *pfn = (u32*)shared_page_addr;
	int allocated = 0;
	int i, r = -ENOMEM;

	for (i = 0; i < npages; i++) {
		node = kzalloc(sizeof(struct balloon_page), GFP_KERNEL);
		if (!node)
			goto out_free;

		node->bpage = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
		if (!node->bpage) {
			kfree(node);
			goto out_free;
		}

		list_add(&node->bp_list, &tmp_list);

		allocated++;
		*pfn++ = page_to_pfn(node->bpage);
	}

	spin_lock(&balloon_plist_lock);

	r = kvm_trigger_balloon_op(npages);
	if (r < 0) {
		printk(KERN_DEBUG "%s: got kvm_trigger_balloon_op res=%d\n",
		       __FUNCTION__, r);
		spin_unlock(&balloon_plist_lock);
		goto out_free;
	}

	list_splice(&tmp_list, &balloon_plist);
	balloon_size += allocated;
	printk(KERN_DEBUG "%s: current balloon size=%d\n", __FUNCTION__,
	       balloon_size);

	spin_unlock(&balloon_plist_lock);

	return allocated;
	
out_free:
	list_for_each_entry_safe(node, tmp, &tmp_list, bp_list) {
		__free_page(node->bpage);
		list_del(&node->bp_list);
		kfree(node);
	}

	return r;
}

static int kvm_balloon_deflate(unsigned long *shared_page_addr, int npages)
{
	LIST_HEAD(tmp_list);
	struct balloon_page *node, *tmp;
	u32 *pfn = (u32*)shared_page_addr;
	int deallocated = 0;
	int r = 0;

	spin_lock(&balloon_plist_lock);

	if (balloon_size < npages) {
		printk(KERN_DEBUG "%s: error balloon=%d while deflate rq=%d\n",
		       __FUNCTION__, balloon_size, npages);
		npages = balloon_size;
	}

	/*
	 * Move the balloon pages to tmp list before issuing 
	 * the hypercall
	 */
	list_for_each_entry_safe(node, tmp, &balloon_plist, bp_list) {
		*pfn++ = page_to_pfn(node->bpage);
		list_move(&node->bp_list, &tmp_list);
		if (++deallocated == npages)
			break;
	}

	r = kvm_trigger_balloon_op(-npages);
	if (r < 0) {
		printk(KERN_DEBUG "%s: got kvm_trigger_balloon_op res=%d\n",
		       __FUNCTION__, r);
		goto out;
	}

	list_for_each_entry_safe(node, tmp, &tmp_list, bp_list) {
		__free_page(node->bpage);
		list_del(&node->bp_list);
		kfree(node);
	}
	balloon_size -= npages;
	printk(KERN_DEBUG "%s: current balloon size=%d\n", __FUNCTION__,
	       balloon_size);

	spin_unlock(&balloon_plist_lock);

	return deallocated;

out:
	npages = 0;
	list_splice(&tmp_list, &balloon_plist);
	spin_unlock(&balloon_plist_lock);

	return r;
}

#define MAX_BALLOON_PAGES_PER_OP (PAGE_SIZE/sizeof(u32))
#define MAX_BALLOON_XFLATE_OP 1000000

static int kvm_balloon_xflate(struct kvm_balloon_op *balloon_op)
{
	unsigned long *shared_page_addr;
	int r = -EINVAL, i;
	int iterations;
	int npages;
	int curr_pages = 0;
	int gfns_per_page;

	if (balloon_op->npages < -MAX_BALLOON_XFLATE_OP ||
	    balloon_op->npages > MAX_BALLOON_XFLATE_OP ||
	    !balloon_op->npages) {
		printk(KERN_DEBUG "%s: got bad npages=%d\n",
		       __FUNCTION__, balloon_op->npages);
		return -EINVAL;
	}

	npages = abs(balloon_op->npages);

	printk(KERN_DEBUG "%s: got %s, npages=%d\n", __FUNCTION__,
	       (balloon_op->npages > 0)? "inflate":"deflate", npages);

	gfns_per_page = MAX_BALLOON_PAGES_PER_OP;
	shared_page_addr = __va(balloon_shared_gfn << PAGE_SHIFT);

	/*
	 * Call the balloon hypercall in PAGE_SIZE*pfns-per-page
	 * iterations
	 */
	iterations = DIV_ROUND_UP(npages, gfns_per_page);
	printk(KERN_DEBUG "%s: iterations=%d\n", __FUNCTION__, iterations);

	for (i = 0; i < iterations; i++) {
		int pages_in_iteration = 
			min(npages - curr_pages, gfns_per_page);

		if (balloon_op->npages > 0)
			r = kvm_balloon_inflate(shared_page_addr,
						pages_in_iteration);
		else
			r = kvm_balloon_deflate(shared_page_addr,
						pages_in_iteration);
		if (r < 0)
			return r;
		curr_pages += r;
		if (r != pages_in_iteration)
			break;
	}

	return curr_pages;
}

static long kvm_balloon_ioctl(struct file *filp,
			      unsigned int ioctl, unsigned long arg)
{
	int r = -EINVAL;
	void __user *argp = (void __user *)arg;

	switch (ioctl) {
	case KVM_BALLOON_OP: {
		struct kvm_balloon_op balloon_op;

		r = -EFAULT;
		if (copy_from_user(&balloon_op, argp, sizeof balloon_op))
			goto out;

		r = kvm_balloon_xflate(&balloon_op);
		if (r < 0)
			goto out;
		balloon_op.npages = r;

		r = -EFAULT;
		if (copy_to_user(argp, &balloon_op, sizeof balloon_op))
			goto out;
		r = 0;
		break;
	}
	default:
		;
	}
out:
	return r;
}

static int kvm_balloon_open(struct inode *inode, struct file *filp)
{
	return 0;
}

static int kvm_balloon_release(struct inode *inode, struct file *filp)
{
	return 0;
}

static struct file_operations balloon_chardev_ops = {
	.open		= kvm_balloon_open,
	.release        = kvm_balloon_release,
	.unlocked_ioctl = kvm_balloon_ioctl,
	.compat_ioctl   = kvm_balloon_ioctl,
};

static struct miscdevice kvm_balloon_dev = {
	KVM_BALLOON_MINOR,
	"kvm_balloon",
	&balloon_chardev_ops,
};

static int __init kvm_balloon_init(void)
{
	struct page *gfns_page;
	int r = 0;

	balloon_chardev_ops.owner = THIS_MODULE;
	if (misc_register(&kvm_balloon_dev)) {
		printk (KERN_ERR "balloon: misc device register failed\n");
		return -EBUSY;
	}

	if ((gfns_page = alloc_page(GFP_KERNEL)) == NULL) {
		r = -ENOMEM;
		goto out;
	}

	balloon_shared_gfn = page_to_pfn(gfns_page);
	
	return 0;

out:
	misc_deregister(&kvm_balloon_dev);
	return r;
}

static void __exit kvm_balloon_exit(void)
{
	misc_deregister(&kvm_balloon_dev);

	spin_lock(&balloon_plist_lock);

	/*
	 * I dont free the pages because the KVM had revoked access
	 * for them so it's a leak.
	 *
	 * {struct balloon_page *node, *tmp;
	 * list_for_each_entry_safe(node, tmp, &balloon_plist, bp_list) {
	 *	__free_page(node->bpage);
	 *	list_del(&node->bp_list);
	 *	kfree(node);
	 * }}
	 */
	if (balloon_size)
		printk(KERN_ERR "%s: exit while balloon not empty!\n",
			__FUNCTION__);

	spin_unlock(&balloon_plist_lock);

	__free_page(pfn_to_page(balloon_shared_gfn));
}

module_init(kvm_balloon_init);
module_exit(kvm_balloon_exit);

[-- Attachment #3: Type: text/plain, Size: 314 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

[-- Attachment #4: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Expose infrastructure for unpinning guest memory
       [not found] ` <1192138344500-git-send-email-aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-10-11 21:59   ` Izik Eidus
@ 2007-10-12  6:32   ` Avi Kivity
       [not found]     ` <470F14F2.7050800-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 9+ messages in thread
From: Avi Kivity @ 2007-10-12  6:32 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Anthony Liguori wrote:
> Now that we have userspace memory allocation, I wanted to play with ballooning.
> The idea is that when a guest "balloons" down, we simply unpin the underlying
> physical memory and the host kernel may or may not swap it.  To reclaim
> ballooned memory, the guest can just start using it and we'll pin it on demand.
>
> The following patch is a stab at providing the right infrastructure for pinning
> and automatic repinning.  I don't have a lot of comfort in the MMU code so I
> thought I'd get some feedback before going much further.
>
> gpa_to_hpa is a little awkward to hook, but it seems like the right place in the
> code.  I'm most uncertain about the SMP safety of the unpinning.  Presumably,
> I have to hold the kvm lock around the mmu_unshadow and page_cache release to
> ensure that another VCPU doesn't fault the page back in after mmu_unshadow?
>
>   

One we have true swapping capabilities (which imply ability for the
kernel to remove a page from the shadow page tables) you can unpin by
calling munmap() or madvise(MADV_REMOVE) on the pages to be unpinned.

Other than that the approach seems right.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Expose infrastructure for unpinning guest memory
       [not found]     ` <470F14F2.7050800-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-10-12 18:52       ` Anthony Liguori
       [not found]         ` <470FC25A.70607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Anthony Liguori @ 2007-10-12 18:52 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Avi Kivity wrote:
> Anthony Liguori wrote:
>   
>> Now that we have userspace memory allocation, I wanted to play with ballooning.
>> The idea is that when a guest "balloons" down, we simply unpin the underlying
>> physical memory and the host kernel may or may not swap it.  To reclaim
>> ballooned memory, the guest can just start using it and we'll pin it on demand.
>>
>> The following patch is a stab at providing the right infrastructure for pinning
>> and automatic repinning.  I don't have a lot of comfort in the MMU code so I
>> thought I'd get some feedback before going much further.
>>
>> gpa_to_hpa is a little awkward to hook, but it seems like the right place in the
>> code.  I'm most uncertain about the SMP safety of the unpinning.  Presumably,
>> I have to hold the kvm lock around the mmu_unshadow and page_cache release to
>> ensure that another VCPU doesn't fault the page back in after mmu_unshadow?
>>
>>   
>>     
>
> One we have true swapping capabilities (which imply ability for the
> kernel to remove a page from the shadow page tables) you can unpin by
> calling munmap() or madvise(MADV_REMOVE) on the pages to be unpinned.
>   

So does MADV_REMOVE remove the backing page but still allow for memory 
to be faulted in?  That is, after calling MADV_REMOVE, there's no 
guarantee that the contents of a give VA range will remain the same (but 
it won't SEGV the app if it accesses that memory)?

If so, I think that would be the right way to treat it.  That allows for 
two types of hints for the guest to provide: 1) I won't access this 
memory for a very long time (so it's a good candidate to swap out) and 
2) I won't access this memory and don't care about it's contents.

Regards,

Anthony Liguori

> Other than that the approach seems right.
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Expose infrastructure for unpinning guest memory
       [not found]         ` <470FC25A.70607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-10-13  7:21           ` Avi Kivity
  2007-10-15  8:10           ` Carsten Otte
  1 sibling, 0 replies; 9+ messages in thread
From: Avi Kivity @ 2007-10-13  7:21 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Anthony Liguori wrote:
> Avi Kivity wrote:
>   
>> Anthony Liguori wrote:
>>   
>>     
>>> Now that we have userspace memory allocation, I wanted to play with ballooning.
>>> The idea is that when a guest "balloons" down, we simply unpin the underlying
>>> physical memory and the host kernel may or may not swap it.  To reclaim
>>> ballooned memory, the guest can just start using it and we'll pin it on demand.
>>>
>>> The following patch is a stab at providing the right infrastructure for pinning
>>> and automatic repinning.  I don't have a lot of comfort in the MMU code so I
>>> thought I'd get some feedback before going much further.
>>>
>>> gpa_to_hpa is a little awkward to hook, but it seems like the right place in the
>>> code.  I'm most uncertain about the SMP safety of the unpinning.  Presumably,
>>> I have to hold the kvm lock around the mmu_unshadow and page_cache release to
>>> ensure that another VCPU doesn't fault the page back in after mmu_unshadow?
>>>
>>>   
>>>     
>>>       
>> One we have true swapping capabilities (which imply ability for the
>> kernel to remove a page from the shadow page tables) you can unpin by
>> calling munmap() or madvise(MADV_REMOVE) on the pages to be unpinned.
>>   
>>     
>
> So does MADV_REMOVE remove the backing page but still allow for memory 
> to be faulted in?  That is, after calling MADV_REMOVE, there's no 
> guarantee that the contents of a give VA range will remain the same (but 
> it won't SEGV the app if it accesses that memory)?
>
>   

I think so.  The docs aren't clear.  See also MADV_DONTNEED.




-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Expose infrastructure for unpinning guest memory
       [not found]         ` <470FC25A.70607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-10-13  7:21           ` Avi Kivity
@ 2007-10-15  8:10           ` Carsten Otte
  1 sibling, 0 replies; 9+ messages in thread
From: Carsten Otte @ 2007-10-15  8:10 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Avi Kivity

Anthony Liguori wrote:
> So does MADV_REMOVE remove the backing page but still allow for memory 
> to be faulted in?  That is, after calling MADV_REMOVE, there's no 
> guarantee that the contents of a give VA range will remain the same (but 
> it won't SEGV the app if it accesses that memory)?
> 
> If so, I think that would be the right way to treat it.  That allows for 
> two types of hints for the guest to provide: 1) I won't access this 
> memory for a very long time (so it's a good candidate to swap out) and 
> 2) I won't access this memory and don't care about it's contents.
You really want MADV_DONTNEED. It does what one would expect: tell the 
kernel you'd prefer to see it discarded but it remains mapped so that 
you can fault it in. My xip code got into conflice with subject kernel 
feature once, that's why I had to care what it does.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-10-15  8:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-11 21:32 [RFC] Expose infrastructure for unpinning guest memory Anthony Liguori
     [not found] ` <1192138344500-git-send-email-aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-10-11 21:59   ` Izik Eidus
     [not found]     ` <470E9CB6.4030107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-10-11 22:18       ` Anthony Liguori
     [not found]         ` <470EA136.8020100-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-10-11 22:31           ` Izik Eidus
2007-10-12  0:11           ` Dor Laor
2007-10-12  6:32   ` Avi Kivity
     [not found]     ` <470F14F2.7050800-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-10-12 18:52       ` Anthony Liguori
     [not found]         ` <470FC25A.70607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-10-13  7:21           ` Avi Kivity
2007-10-15  8:10           ` Carsten Otte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox