[RFC]kvm: swapout guest page

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC]kvm: swapout guest page
@ 2007-05-21  8:12 Shaohua Li
       [not found] ` <288dbef70705210112t710bc904pe546840f7b9cfcfa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2007-05-21  8:12 UTC (permalink / raw)
  To: kvm-devel

[-- Attachment #1: Type: text/plain, Size: 1404 bytes --]

Hi,
I saw some discussions on the topic but no progress. I did an
experiment to make guest page be allocated dynamically and swap out.
please see attachment patches. It's not yet for merge but I'd like get
some suggestions and help. Patches (against kvm-19) work here but
maybe not very stable as there should be some lock issue for swapout,
which I'll do more check later. If you are brave, please try :). Some
issues I have:
1. there is a spinlock to pretoct kvm struct, we can't sleep in it. A
possible solution is do a 'release lock, sleep and retry', but the
shadow page fault path sounds not easy to follow it. The spinlock also
prevents vcpu is migrated to other cpus as vmx operation must be done
in the cpu vcpu runs. I changed it to a semaphore plus a cpu affinity
setting. It's a little hacky, I'd see if there are better approaches.
2. Linux page relcaim can't get if a guest page is referenced often.
My current patch just bliendly adds guest page to lru, not optimized.
3. kvm_ops.tlb_flush should really send an IPI to make the vcpu flush
tlb, as it might be called in other cpus other than the cpu vcpu run.
This makes the swapout path not be able to zap shadow page tables. My
patch just skip any guest page which has shadow page table points to.
I assume kvm smp guest support will improve the tlb_flush.

please cc me for any reply, as I didn't subscribe to the list.

Thanks,
Shaohua

[-- Attachment #2: export-symbol.patch --]
[-- Type: text/x-patch, Size: 2109 bytes --]

symbols swapout required

Index: 2.6.21-rc7/mm/swap_state.c
===================================================================
--- 2.6.21-rc7.orig/mm/swap_state.c	2007-04-24 02:20:00.000000000 +0800
+++ 2.6.21-rc7/mm/swap_state.c	2007-05-21 10:10:20.000000000 +0800
@@ -207,6 +207,7 @@ void delete_from_swap_cache(struct page 
 	swap_free(entry);
 	page_cache_release(page);
 }
+EXPORT_SYMBOL(delete_from_swap_cache);
 
 /*
  * Strange swizzling function only for use by shmem_writepage
@@ -225,6 +226,7 @@ int move_to_swap_cache(struct page *page
 		INC_CACHE_INFO(exist_race);
 	return err;
 }
+EXPORT_SYMBOL(move_to_swap_cache);
 
 /*
  * Strange swizzling function for shmem_getpage (and shmem_unuse)
@@ -307,6 +309,7 @@ struct page * lookup_swap_cache(swp_entr
 	INC_CACHE_INFO(find_total);
 	return page;
 }
+EXPORT_SYMBOL(lookup_swap_cache);
 
 /* 
  * Locate a page of swap in physical memory, reserving swap cache space
@@ -364,3 +367,4 @@ struct page *read_swap_cache_async(swp_e
 		page_cache_release(new_page);
 	return found_page;
 }
+EXPORT_SYMBOL(read_swap_cache_async);
Index: 2.6.21-rc7/mm/swapfile.c
===================================================================
--- 2.6.21-rc7.orig/mm/swapfile.c	2007-04-24 02:20:00.000000000 +0800
+++ 2.6.21-rc7/mm/swapfile.c	2007-05-21 10:10:20.000000000 +0800
@@ -211,6 +211,7 @@ noswap:
 	spin_unlock(&swap_lock);
 	return (swp_entry_t) {0};
 }
+EXPORT_SYMBOL(get_swap_page);
 
 swp_entry_t get_swap_page_of_type(int type)
 {
@@ -303,6 +304,7 @@ void swap_free(swp_entry_t entry)
 		spin_unlock(&swap_lock);
 	}
 }
+EXPORT_SYMBOL(swap_free);
 
 /*
  * How many references to page are currently swapped out?
Index: 2.6.21-rc7/mm/filemap.c
===================================================================
--- 2.6.21-rc7.orig/mm/filemap.c	2007-04-24 02:20:00.000000000 +0800
+++ 2.6.21-rc7/mm/filemap.c	2007-05-21 10:11:09.000000000 +0800
@@ -465,6 +465,7 @@ int add_to_page_cache_lru(struct page *p
 		lru_cache_add(page);
 	return ret;
 }
+EXPORT_SYMBOL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
 struct page *__page_cache_alloc(gfp_t gfp)

[-- Attachment #3: mutex.patch --]
[-- Type: text/x-patch, Size: 11272 bytes --]

against kvm-19
kvm lock is a spinlock, changed it to a mutex so can sleep in some pathes.
The kvm pagefault path is not easy to convert to a 'release lock and retry'.

Index: kvm/kernel/kvm.h
===================================================================
--- kvm.orig/kernel/kvm.h	2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/kvm.h	2007-05-21 09:12:23.000000000 +0800
@@ -238,6 +238,7 @@
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	cpumask_t saved_mask;
 	union {
 		struct vmcs *vmcs;
 		struct vcpu_svm *svm;
@@ -328,7 +329,7 @@
 };
 
 struct kvm {
-	spinlock_t lock; /* protects everything except vcpus */
+	struct mutex lock; /* protects everything except vcpus */
 	int naliases;
 	struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
 	int nmemslots;
Index: kvm/kernel/kvm_main.c
===================================================================
--- kvm.orig/kernel/kvm_main.c	2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/kvm_main.c	2007-05-21 09:16:32.000000000 +0800
@@ -292,7 +292,7 @@
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);
 
-	spin_lock_init(&kvm->lock);
+	mutex_init(&kvm->lock);
 	INIT_LIST_HEAD(&kvm->active_mmu_pages);
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
 		struct kvm_vcpu *vcpu = &kvm->vcpus[i];
@@ -422,7 +422,7 @@
 	int ret;
 	struct page *page;
 
-	spin_lock(&vcpu->kvm->lock);
+	mutex_lock(&vcpu->kvm->lock);
 	page = gfn_to_page(vcpu->kvm, pdpt_gfn);
 	/* FIXME: !page - emulate? 0xff? */
 	pdpt = kmap_atomic(page, KM_USER0);
@@ -441,7 +441,7 @@
 
 out:
 	kunmap_atomic(pdpt, KM_USER0);
-	spin_unlock(&vcpu->kvm->lock);
+	mutex_unlock(&vcpu->kvm->lock);
 
 	return ret;
 }
@@ -501,9 +501,9 @@
 	kvm_arch_ops->set_cr0(vcpu, cr0);
 	vcpu->cr0 = cr0;
 
-	spin_lock(&vcpu->kvm->lock);
+	mutex_lock(&vcpu->kvm->lock);
 	kvm_mmu_reset_context(vcpu);
-	spin_unlock(&vcpu->kvm->lock);
+	mutex_unlock(&vcpu->kvm->lock);
 	return;
 }
 EXPORT_SYMBOL_GPL(set_cr0);
@@ -542,9 +542,9 @@
 		return;
 	}
 	kvm_arch_ops->set_cr4(vcpu, cr4);
-	spin_lock(&vcpu->kvm->lock);
+	mutex_lock(&vcpu->kvm->lock);
 	kvm_mmu_reset_context(vcpu);
-	spin_unlock(&vcpu->kvm->lock);
+	mutex_unlock(&vcpu->kvm->lock);
 }
 EXPORT_SYMBOL_GPL(set_cr4);
 
@@ -572,7 +572,7 @@
 	}
 
 	vcpu->cr3 = cr3;
-	spin_lock(&vcpu->kvm->lock);
+	mutex_lock(&vcpu->kvm->lock);
 	/*
 	 * Does the new cr3 value map to physical memory? (Note, we
 	 * catch an invalid cr3 even in real-mode, because it would
@@ -586,7 +586,7 @@
 		inject_gp(vcpu);
 	else
 		vcpu->mmu.new_cr3(vcpu);
-	spin_unlock(&vcpu->kvm->lock);
+	mutex_unlock(&vcpu->kvm->lock);
 }
 EXPORT_SYMBOL_GPL(set_cr3);
 
@@ -629,9 +629,9 @@
 
 static void do_remove_write_access(struct kvm_vcpu *vcpu, int slot)
 {
-	spin_lock(&vcpu->kvm->lock);
+	mutex_lock(&vcpu->kvm->lock);
 	kvm_mmu_slot_remove_write_access(vcpu, slot);
-	spin_unlock(&vcpu->kvm->lock);
+	mutex_unlock(&vcpu->kvm->lock);
 }
 
 /*
@@ -670,7 +670,7 @@
 		mem->flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
 
 raced:
-	spin_lock(&kvm->lock);
+	mutex_lock(&kvm->lock);
 
 	memory_config_version = kvm->memory_config_version;
 	new = old = *memslot;
@@ -699,7 +699,7 @@
 	 * Do memory allocations outside lock.  memory_config_version will
 	 * detect any races.
 	 */
-	spin_unlock(&kvm->lock);
+	mutex_unlock(&kvm->lock);
 
 	/* Deallocate if slot is being removed */
 	if (!npages)
@@ -738,10 +738,10 @@
 		memset(new.dirty_bitmap, 0, dirty_bytes);
 	}
 
-	spin_lock(&kvm->lock);
+	mutex_lock(&kvm->lock);
 
 	if (memory_config_version != kvm->memory_config_version) {
-		spin_unlock(&kvm->lock);
+		mutex_unlock(&kvm->lock);
 		kvm_free_physmem_slot(&new, &old);
 		goto raced;
 	}
@@ -756,7 +756,7 @@
 	*memslot = new;
 	++kvm->memory_config_version;
 
-	spin_unlock(&kvm->lock);
+	mutex_unlock(&kvm->lock);
 
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
 		struct kvm_vcpu *vcpu;
@@ -774,7 +774,7 @@
 	return 0;
 
 out_unlock:
-	spin_unlock(&kvm->lock);
+	mutex_unlock(&kvm->lock);
 out_free:
 	kvm_free_physmem_slot(&new, &old);
 out:
@@ -793,14 +793,14 @@
 	int cleared;
 	unsigned long any = 0;
 
-	spin_lock(&kvm->lock);
+	mutex_lock(&kvm->lock);
 
 	/*
 	 * Prevent changes to guest memory configuration even while the lock
 	 * is not taken.
 	 */
 	++kvm->busy;
-	spin_unlock(&kvm->lock);
+	mutex_unlock(&kvm->lock);
 	r = -EINVAL;
 	if (log->slot >= KVM_MEMORY_SLOTS)
 		goto out;
@@ -840,9 +840,9 @@
 	r = 0;
 
 out:
-	spin_lock(&kvm->lock);
+	mutex_lock(&kvm->lock);
 	--kvm->busy;
-	spin_unlock(&kvm->lock);
+	mutex_unlock(&kvm->lock);
 	return r;
 }
 
@@ -872,7 +872,7 @@
 	    < alias->target_phys_addr)
 		goto out;
 
-	spin_lock(&kvm->lock);
+	mutex_lock(&kvm->lock);
 
 	p = &kvm->aliases[alias->slot];
 	p->base_gfn = alias->guest_phys_addr >> PAGE_SHIFT;
@@ -884,12 +884,12 @@
 			break;
 	kvm->naliases = n;
 
-	spin_unlock(&kvm->lock);
+	mutex_unlock(&kvm->lock);
 
 	vcpu_load(&kvm->vcpus[0]);
-	spin_lock(&kvm->lock);
+	mutex_lock(&kvm->lock);
 	kvm_mmu_zap_all(&kvm->vcpus[0]);
-	spin_unlock(&kvm->lock);
+	mutex_unlock(&kvm->lock);
 	vcpu_put(&kvm->vcpus[0]);
 
 	return 0;
@@ -1408,7 +1408,7 @@
 
 	mark_page_dirty(vcpu->kvm, para_state_gpa >> PAGE_SHIFT);
 	para_state_page = pfn_to_page(para_state_hpa >> PAGE_SHIFT);
-	para_state = kmap_atomic(para_state_page, KM_USER0);
+	para_state = kmap(para_state_page);
 
 	printk(KERN_DEBUG "....  guest version: %d\n", para_state->guest_version);
 	printk(KERN_DEBUG "....           size: %d\n", para_state->size);
@@ -1444,7 +1444,7 @@
 
 	para_state->ret = 0;
 err_kunmap_skip:
-	kunmap_atomic(para_state, KM_USER0);
+	kunmap(para_state);
 	return 0;
 err_gp:
 	return 1;
@@ -1792,12 +1792,12 @@
 	vcpu->pio.cur_count = now;
 
 	for (i = 0; i < nr_pages; ++i) {
-		spin_lock(&vcpu->kvm->lock);
+		mutex_lock(&vcpu->kvm->lock);
 		page = gva_to_page(vcpu, address + i * PAGE_SIZE);
 		if (page)
 			get_page(page);
 		vcpu->pio.guest_pages[i] = page;
-		spin_unlock(&vcpu->kvm->lock);
+		mutex_unlock(&vcpu->kvm->lock);
 		if (!page) {
 			inject_gp(vcpu);
 			free_pio_guest_pages(vcpu);
@@ -2170,13 +2170,13 @@
 	gpa_t gpa;
 
 	vcpu_load(vcpu);
-	spin_lock(&vcpu->kvm->lock);
+	mutex_lock(&vcpu->kvm->lock);
 	gpa = vcpu->mmu.gva_to_gpa(vcpu, vaddr);
 	tr->physical_address = gpa;
 	tr->valid = gpa != UNMAPPED_GVA;
 	tr->writeable = 1;
 	tr->usermode = 0;
-	spin_unlock(&vcpu->kvm->lock);
+	mutex_unlock(&vcpu->kvm->lock);
 	vcpu_put(vcpu);
 
 	return 0;
Index: kvm/kernel/mmu.c
===================================================================
--- kvm.orig/kernel/mmu.c	2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/mmu.c	2007-05-21 09:12:23.000000000 +0800
@@ -241,11 +241,11 @@
 
 	r = __mmu_topup_memory_caches(vcpu, GFP_NOWAIT);
 	if (r < 0) {
-		spin_unlock(&vcpu->kvm->lock);
+		mutex_unlock(&vcpu->kvm->lock);
 		kvm_arch_ops->vcpu_put(vcpu);
 		r = __mmu_topup_memory_caches(vcpu, GFP_KERNEL);
 		kvm_arch_ops->vcpu_load(vcpu);
-		spin_lock(&vcpu->kvm->lock);
+		mutex_lock(&vcpu->kvm->lock);
 	}
 	return r;
 }
Index: kvm/kernel/svm.c
===================================================================
--- kvm.orig/kernel/svm.c	2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/svm.c	2007-05-18 09:02:10.000000000 +0800
@@ -612,7 +612,14 @@
 {
 	int cpu;
 
-	cpu = get_cpu();
+	if (cpus_empty(vcpu->saved_mask)) {
+		vcpu->saved_mask = current->cpus_allowed;
+		set_cpus_allowed(current, cpumask_of_cpu(smp_processor_id()));
+	} else {
+		printk("nest vcpu load\n");
+		dump_stack();
+	}
+	cpu = smp_processor_id();
 	if (unlikely(cpu != vcpu->cpu)) {
 		u64 tsc_this, delta;
 
@@ -630,7 +637,8 @@
 static void svm_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	rdtscll(vcpu->host_tsc);
-	put_cpu();
+	set_cpus_allowed(current, vcpu->saved_mask);
+	cpus_clear(vcpu->saved_mask);
 }
 
 static void svm_vcpu_decache(struct kvm_vcpu *vcpu)
@@ -894,21 +902,21 @@
 	if (is_external_interrupt(exit_int_info))
 		push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
 
-	spin_lock(&vcpu->kvm->lock);
+	mutex_lock(&vcpu->kvm->lock);
 
 	fault_address  = vcpu->svm->vmcb->control.exit_info_2;
 	error_code = vcpu->svm->vmcb->control.exit_info_1;
 	r = kvm_mmu_page_fault(vcpu, fault_address, error_code);
 	if (r < 0) {
-		spin_unlock(&vcpu->kvm->lock);
+		mutex_unlock(&vcpu->kvm->lock);
 		return r;
 	}
 	if (!r) {
-		spin_unlock(&vcpu->kvm->lock);
+		mutex_unlock(&vcpu->kvm->lock);
 		return 1;
 	}
 	er = emulate_instruction(vcpu, kvm_run, fault_address, error_code);
-	spin_unlock(&vcpu->kvm->lock);
+	mutex_unlock(&vcpu->kvm->lock);
 
 	switch (er) {
 	case EMULATE_DONE:
Index: kvm/kernel/vmx.c
===================================================================
--- kvm.orig/kernel/vmx.c	2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/vmx.c	2007-05-18 09:02:10.000000000 +0800
@@ -209,7 +209,14 @@
 	u64 phys_addr = __pa(vcpu->vmcs);
 	int cpu;
 
-	cpu = get_cpu();
+	if (cpus_empty(vcpu->saved_mask)) {
+		vcpu->saved_mask = current->cpus_allowed;
+		set_cpus_allowed(current, cpumask_of_cpu(smp_processor_id()));
+	} else {
+		printk("nest vcpu load\n");
+		dump_stack();
+	}
+	cpu = smp_processor_id();
 
 	if (vcpu->cpu != cpu)
 		vcpu_clear(vcpu);
@@ -246,7 +253,8 @@
 
 static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 {
-	put_cpu();
+	set_cpus_allowed(current, vcpu->saved_mask);
+	cpus_clear(vcpu->saved_mask);
 }
 
 static void vmx_vcpu_decache(struct kvm_vcpu *vcpu)
@@ -1329,19 +1337,19 @@
 	if (is_page_fault(intr_info)) {
 		cr2 = vmcs_readl(EXIT_QUALIFICATION);
 
-		spin_lock(&vcpu->kvm->lock);
+		mutex_lock(&vcpu->kvm->lock);
 		r = kvm_mmu_page_fault(vcpu, cr2, error_code);
 		if (r < 0) {
-			spin_unlock(&vcpu->kvm->lock);
+			mutex_unlock(&vcpu->kvm->lock);
 			return r;
 		}
 		if (!r) {
-			spin_unlock(&vcpu->kvm->lock);
+			mutex_unlock(&vcpu->kvm->lock);
 			return 1;
 		}
 
 		er = emulate_instruction(vcpu, kvm_run, cr2, error_code);
-		spin_unlock(&vcpu->kvm->lock);
+		mutex_unlock(&vcpu->kvm->lock);
 
 		switch (er) {
 		case EMULATE_DONE:
Index: kvm/kernel/paging_tmpl.h
===================================================================
--- kvm.orig/kernel/paging_tmpl.h	2007-05-21 09:12:23.000000000 +0800
+++ kvm/kernel/paging_tmpl.h	2007-05-21 09:14:54.000000000 +0800
@@ -98,7 +98,7 @@
 		 walker->level - 1, table_gfn);
 	slot = gfn_to_memslot(vcpu->kvm, table_gfn);
 	hpa = safe_gpa_to_hpa(vcpu, root & PT64_BASE_ADDR_MASK);
-	walker->table = kmap_atomic(pfn_to_page(hpa >> PAGE_SHIFT), KM_USER0);
+	walker->table = kmap(pfn_to_page(hpa >> PAGE_SHIFT));
 
 	ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) ||
 	       (vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) == 0);
@@ -151,9 +151,8 @@
 		walker->inherited_ar &= walker->table[index];
 		table_gfn = (*ptep & PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
 		paddr = safe_gpa_to_hpa(vcpu, *ptep & PT_BASE_ADDR_MASK);
-		kunmap_atomic(walker->table, KM_USER0);
-		walker->table = kmap_atomic(pfn_to_page(paddr >> PAGE_SHIFT),
-					    KM_USER0);
+		kunmap(walker->table);
+		walker->table = kmap(pfn_to_page(paddr >> PAGE_SHIFT));
 		--walker->level;
 		walker->table_gfn[walker->level - 1 ] = table_gfn;
 		pgprintk("%s: table_gfn[%d] %lx\n", __FUNCTION__,
@@ -183,7 +182,7 @@
 static void FNAME(release_walker)(struct guest_walker *walker)
 {
 	if (walker->table)
-		kunmap_atomic(walker->table, KM_USER0);
+		kunmap(walker->table);
 }
 
 static void FNAME(mark_pagetable_dirty)(struct kvm *kvm,

[-- Attachment #4: swap-guest-page.patch --]
[-- Type: text/x-patch, Size: 13811 bytes --]

against kvm-19

permit guest page to be allocated dynamically and to be swaped out.
Index: kvm/kernel/mmu.c
===================================================================
--- kvm.orig/kernel/mmu.c	2007-05-21 09:20:11.000000000 +0800
+++ kvm/kernel/mmu.c	2007-05-21 09:20:26.000000000 +0800
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/module.h>
+#include <linux/pagemap.h>
 
 #include "vmx.h"
 #include "kvm.h"
@@ -194,6 +195,7 @@
 
 static int is_rmap_pte(u64 pte)
 {
+	return 1;
 	return (pte & (PT_WRITABLE_MASK | PT_PRESENT_MASK))
 		== (PT_WRITABLE_MASK | PT_PRESENT_MASK);
 }
@@ -320,6 +322,8 @@
 	if (!page_private(page)) {
 		rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
 		set_page_private(page,(unsigned long)spte);
+		SetPagePrivate(page);
+		page_cache_get(page);
 	} else if (!(page_private(page) & 1)) {
 		rmap_printk("rmap_add: %p %llx 1->many\n", spte, *spte);
 		desc = mmu_alloc_rmap_desc(vcpu);
@@ -355,9 +359,13 @@
 	desc->shadow_ptes[j] = NULL;
 	if (j != 0)
 		return;
-	if (!prev_desc && !desc->more)
+	if (!prev_desc && !desc->more) {
 		set_page_private(page,(unsigned long)desc->shadow_ptes[0]);
-	else
+		if (page_private(page) == 0) {
+			ClearPagePrivate(page);
+			page_cache_release(page);
+		}
+	} else
 		if (prev_desc)
 			prev_desc->more = desc->more;
 		else
@@ -386,6 +394,8 @@
 			BUG();
 		}
 		set_page_private(page,0);
+		ClearPagePrivate(page);
+		page_cache_release(page);
 	} else {
 		rmap_printk("rmap_remove:  %p %llx many->many\n", spte, *spte);
 		desc = (struct kvm_rmap_desc *)(page_private(page) & ~1ul);
@@ -405,32 +415,44 @@
 	}
 }
 
+static void rmap_write_protect_one(struct kvm_vcpu *vcpu, u64 *spte, struct page *page)
+{
+		BUG_ON(!spte);
+		BUG_ON((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT
+		       != page_to_pfn(page));
+		BUG_ON(!(*spte & PT_PRESENT_MASK));
+//		BUG_ON(!(*spte & PT_WRITABLE_MASK));
+		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
+//		rmap_remove(vcpu, spte);
+		*spte &= ~(u64)PT_WRITABLE_MASK;
+		kvm_arch_ops->tlb_flush(vcpu);
+}
+
 static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct page *page;
 	struct kvm_rmap_desc *desc;
 	u64 *spte;
+	int i;
 
 	page = gfn_to_page(kvm, gfn);
 	BUG_ON(!page);
 
-	while (page_private(page)) {
-		if (!(page_private(page) & 1))
-			spte = (u64 *)page_private(page);
-		else {
-			desc = (struct kvm_rmap_desc *)(page_private(page) & ~1ul);
-			spte = desc->shadow_ptes[0];
+	if (!page_private(page))
+		return;
+	if (!(page_private(page) & 1)) {
+		spte = (u64 *)page_private(page);
+		rmap_write_protect_one(vcpu, spte, page);
+		return;
+	}
+	desc = (struct kvm_rmap_desc *)(page_private(page) & ~1ul);
+	while (desc) {
+		for (i = 0; i < RMAP_EXT && desc->shadow_ptes[i]; i++) {
+			spte = desc->shadow_ptes[i];
+			rmap_write_protect_one(vcpu, spte, page);
 		}
-		BUG_ON(!spte);
-		BUG_ON((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT
-		       != page_to_pfn(page));
-		BUG_ON(!(*spte & PT_PRESENT_MASK));
-		BUG_ON(!(*spte & PT_WRITABLE_MASK));
-		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
-		rmap_remove(vcpu, spte);
-		kvm_arch_ops->tlb_flush(vcpu);
-		*spte &= ~(u64)PT_WRITABLE_MASK;
+		desc = desc->more;
 	}
 }
 
@@ -1099,11 +1121,23 @@
 	}
 }
 
+static void mmu_zap_active_pages(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_page *page;
+
+	while (!list_empty(&vcpu->kvm->active_mmu_pages)) {
+		page = container_of(vcpu->kvm->active_mmu_pages.next,
+				    struct kvm_mmu_page, link);
+		kvm_mmu_zap_page(vcpu, page);
+	}
+}
+
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
 {
 	int r;
 
 	destroy_kvm_mmu(vcpu);
+	mmu_zap_active_pages(vcpu);
 	r = init_kvm_mmu(vcpu);
 	if (r < 0)
 		goto out;
@@ -1231,11 +1265,8 @@
 {
 	struct kvm_mmu_page *page;
 
-	while (!list_empty(&vcpu->kvm->active_mmu_pages)) {
-		page = container_of(vcpu->kvm->active_mmu_pages.next,
-				    struct kvm_mmu_page, link);
-		kvm_mmu_zap_page(vcpu, page);
-	}
+	mmu_zap_active_pages(vcpu);
+
 	while (!list_empty(&vcpu->free_pages)) {
 		page = list_entry(vcpu->free_pages.next,
 				  struct kvm_mmu_page, link);
@@ -1328,7 +1359,7 @@
 		for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
 			/* avoid RMW */
 			if (pt[i] & PT_WRITABLE_MASK) {
-				rmap_remove(vcpu, &pt[i]);
+//				rmap_remove(vcpu, &pt[i]);
 				pt[i] &= ~PT_WRITABLE_MASK;
 			}
 	}
@@ -1538,3 +1569,30 @@
 }
 
 #endif
+
+void rmap_zap_pagetbl(struct kvm_vcpu *vcpu, u64 gfn)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_rmap_desc *desc;
+	struct page *page;
+	u64 *spte;
+
+        page = gfn_to_page(kvm, gfn);
+        BUG_ON(!page);
+
+        while (page_private(page)) {
+                if (!(page_private(page) & 1))
+                        spte = (u64 *)page_private(page);
+                else {
+                        desc = (struct kvm_rmap_desc *)(page_private(page) & ~1ul);
+                        spte = desc->shadow_ptes[0];
+                }
+                BUG_ON(!spte);
+                BUG_ON((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT
+                       != page_to_pfn(page));
+                BUG_ON(!(*spte & PT_PRESENT_MASK));
+                rmap_remove(vcpu, spte);
+                kvm_arch_ops->tlb_flush(vcpu);
+                *spte = 0;
+        }
+}
Index: kvm/kernel/paging_tmpl.h
===================================================================
--- kvm.orig/kernel/paging_tmpl.h	2007-05-21 09:20:11.000000000 +0800
+++ kvm/kernel/paging_tmpl.h	2007-05-21 09:20:26.000000000 +0800
@@ -369,7 +369,7 @@
 	*shadow_ent |= PT_WRITABLE_MASK;
 	FNAME(mark_pagetable_dirty)(vcpu->kvm, walker);
 	*guest_ent |= PT_DIRTY_MASK;
-	rmap_add(vcpu, shadow_ent);
+//	rmap_add(vcpu, shadow_ent);
 
 	return 1;
 }
Index: kvm/kernel/kvm_main.c
===================================================================
--- kvm.orig/kernel/kvm_main.c	2007-05-21 09:20:11.000000000 +0800
+++ kvm/kernel/kvm_main.c	2007-05-21 09:58:39.000000000 +0800
@@ -26,6 +26,7 @@
 #include <linux/gfp.h>
 #include <asm/msr.h>
 #include <linux/mm.h>
+#include <linux/pagemap.h>
 #include <linux/miscdevice.h>
 #include <linux/vmalloc.h>
 #include <asm/uaccess.h>
@@ -322,13 +323,15 @@
 {
 	int i;
 
-	if (!dont || free->phys_mem != dont->phys_mem)
-		if (free->phys_mem) {
-			for (i = 0; i < free->npages; ++i)
-				if (free->phys_mem[i])
-					__free_page(free->phys_mem[i]);
-			vfree(free->phys_mem);
+	if ((!dont || free->phys_mem != dont->phys_mem) && free->phys_mem) {
+		for (i = 0; i < free->npages; ++i) {
+			if (free->phys_mem[i].entry.val) {
+				printk("free entry %d\n", free->phys_mem[i].entry.val);
+				swap_free(free->phys_mem[i].entry);
+			}
 		}
+		vfree(free->phys_mem);
+	}
 
 	if (!dont || free->dirty_bitmap != dont->dirty_bitmap)
 		vfree(free->dirty_bitmap);
@@ -388,10 +391,17 @@
 
 static void kvm_destroy_vm(struct kvm *kvm)
 {
+	struct inode *inode = kvm_to_address_space(kvm)->host;
+
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
 	kvm_free_vcpus(kvm);
+
+	mutex_lock(&inode->i_mutex);
+	truncate_inode_pages(inode->i_mapping, 0);
+	mutex_unlock(&inode->i_mutex);
+
 	kvm_free_physmem(kvm);
 	kfree(kvm);
 }
@@ -713,19 +723,12 @@
 
 	/* Allocate if a slot is being created */
 	if (npages && !new.phys_mem) {
-		new.phys_mem = vmalloc(npages * sizeof(struct page *));
+		new.phys_mem = vmalloc(npages * sizeof(struct kvm_swap_entry));
 
 		if (!new.phys_mem)
 			goto out_free;
 
-		memset(new.phys_mem, 0, npages * sizeof(struct page *));
-		for (i = 0; i < npages; ++i) {
-			new.phys_mem[i] = alloc_page(GFP_HIGHUSER
-						     | __GFP_ZERO);
-			if (!new.phys_mem[i])
-				goto out_free;
-			set_page_private(new.phys_mem[i],0);
-		}
+		memset(new.phys_mem, 0, npages * sizeof(struct kvm_swap_entry));
 	}
 
 	/* Allocate page dirty bitmap if needed */
@@ -932,15 +935,105 @@
 	return __gfn_to_memslot(kvm, gfn);
 }
 
-struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
+static struct page *kvm_swapin_page(struct kvm *kvm, gfn_t gfn)
 {
 	struct kvm_memory_slot *slot;
+	struct kvm_swap_entry *entry;
+	struct address_space *mapping = kvm_to_address_space(kvm);
+	struct page *page;
 
-	gfn = unalias_gfn(kvm, gfn);
 	slot = __gfn_to_memslot(kvm, gfn);
 	if (!slot)
 		return NULL;
-	return slot->phys_mem[gfn - slot->base_gfn];
+	entry = &slot->phys_mem[gfn - slot->base_gfn];
+	if (entry->entry.val) {
+		/* page is in swap, read page from swap */
+repeat:
+		page = lookup_swap_cache(entry->entry);
+		if (!page) {
+			page = read_swap_cache_async(entry->entry, NULL, 0);
+			if (!page)
+				return NULL;
+			printk("read page from swap %d\n", gfn);
+			wait_on_page_locked(page);
+			if (!PageUptodate(page)) {
+				page_cache_release(page);
+				return NULL;
+			}
+		}
+		while (TestSetPageLocked(page))
+			wait_on_page_locked(page);
+
+		if (PageWriteback(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		delete_from_swap_cache(page);
+		unlock_page(page);
+		swap_free(entry->entry);
+		entry->entry.val = 0;
+		if (add_to_page_cache(page, mapping, gfn, GFP_ATOMIC))
+			return NULL;
+	} else {
+		/* allocate new page */
+		page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+		if (!page)
+			return NULL;
+		if (add_to_page_cache_lru(page, mapping, gfn, GFP_ATOMIC)) {
+			page_cache_release(page);
+			return NULL;
+		}
+		set_page_private(page, 0);
+	}
+	return page;
+}
+
+#define address_space_to_kvm(m) (m->host->i_private)
+static int kvm_move_to_swap(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct kvm *kvm = address_space_to_kvm(mapping);
+	struct kvm_memory_slot *slot;
+	gfn_t gfn = page->index;
+	swp_entry_t swap;
+
+	swap = get_swap_page();
+	if (!swap.val)
+		goto redirty;
+
+	printk("move page to swap %d\n", page->index);
+	if (move_to_swap_cache(page, swap) == 0) {
+		slot = __gfn_to_memslot(kvm, gfn);
+		slot->phys_mem[gfn - slot->base_gfn].entry = swap;
+		return 0;
+	}
+	swap_free(swap);
+redirty:
+	return AOP_WRITEPAGE_ACTIVATE;
+}
+
+struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
+{
+	struct address_space *mapping = kvm_to_address_space(kvm);
+	struct page *page;
+
+	gfn = unalias_gfn(kvm, gfn);
+
+	page = find_get_page(mapping, gfn);
+	if (page) {
+		page_cache_release(page);
+		return page;
+	}
+	page = kvm_swapin_page(kvm, gfn);
+	if (!page)
+		return NULL;
+	set_page_dirty(page);
+	/* page's ref cnt is 2 */
+	unlock_page(page);
+	page_cache_release(page);
+	return page;
 }
 EXPORT_SYMBOL_GPL(gfn_to_page);
 
@@ -2711,6 +2804,7 @@
 
 static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	file_accessed(file);
 	vma->vm_ops = &kvm_vm_vm_ops;
 	return 0;
 }
@@ -2722,6 +2816,74 @@
 	.mmap           = kvm_vm_mmap,
 };
 
+static int kvm_set_page_dirty(struct page *page)
+{
+	if (!PageDirty(page))
+		SetPageDirty(page);
+	return 0;
+}
+
+extern void rmap_zap_pagetbl(struct kvm_vcpu *vcpu, u64 gfn);
+
+static int kvm_writepage(struct page *page, struct writeback_control *wbc)
+{
+	struct address_space *mapping = page->mapping;
+	struct kvm *kvm = address_space_to_kvm(mapping);
+	int ret = 0;
+
+	printk(KERN_ERR"page wrie back %d, private %d, count %d\n", page->index, PagePrivate(page), page_count(page));
+
+	mutex_lock(&kvm->lock);
+#if 0
+	/* FIXME: get kvm lock and this must run in the CPU as the kvcpu or the vcpu is not in running mode */
+	/* This will clear PagePrivate */
+	if (PagePrivate(page))
+		rmap_zap_pagetbl(&kvm->vcpus[0], page->index);
+#else
+	/* Maybe just drop this page */
+	if (PagePrivate(page)) {
+		ret = AOP_WRITEPAGE_ACTIVATE;
+		set_page_dirty(page);
+		goto out;
+	}
+#endif
+
+	kvm_move_to_swap(page);
+	unlock_page(page);
+out:
+	mutex_unlock(&kvm->lock);
+
+	return ret;
+}
+
+static int kvm_releasepage(struct page *page, gfp_t gfp)
+{
+	/* writepage removes shadow page table, we should never get here */
+	BUG();
+	return 1;
+}
+
+static void kvm_invalidatepage(struct page *page, unsigned long offset)
+{
+	/*
+	 * truncate_page is done after vcpu_free, that means all shadow page
+	 * table should be freed already, we should never get here
+	 */
+	BUG();
+}
+
+static struct address_space_operations kvm_aops = {
+	.releasepage = kvm_releasepage,
+	.invalidatepage = kvm_invalidatepage,
+	.writepage = kvm_writepage,
+	.set_page_dirty = kvm_set_page_dirty,
+};
+
+static struct backing_dev_info kvm_backing_dev_info  __read_mostly = {
+	.ra_pages	= 0,	/* No readahead */
+	.capabilities	= BDI_CAP_NO_ACCT_DIRTY|BDI_CAP_NO_WRITEBACK,
+	.unplug_io_fn	= default_unplug_io_fn,
+};
 static int kvm_dev_ioctl_create_vm(void)
 {
 	int fd, r;
@@ -2735,11 +2897,15 @@
 		goto out1;
 	}
 
+	inode->i_mapping->a_ops = &kvm_aops;
+	inode->i_mapping->backing_dev_info = &kvm_backing_dev_info;
+
 	kvm = kvm_create_vm();
 	if (IS_ERR(kvm)) {
 		r = PTR_ERR(kvm);
 		goto out2;
 	}
+	inode->i_private = kvm;
 
 	file = kvmfs_file(inode, kvm);
 	if (IS_ERR(file)) {
Index: kvm/kernel/kvm.h
===================================================================
--- kvm.orig/kernel/kvm.h	2007-05-21 09:20:11.000000000 +0800
+++ kvm/kernel/kvm.h	2007-05-21 09:20:26.000000000 +0800
@@ -11,6 +11,7 @@
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
 
 #include "vmx.h"
 #include <linux/kvm.h>
@@ -320,11 +321,16 @@
 	gfn_t target_gfn;
 };
 
+struct kvm_swap_entry {
+//	struct page *page;
+	swp_entry_t entry;
+};
+
 struct kvm_memory_slot {
 	gfn_t base_gfn;
 	unsigned long npages;
 	unsigned long flags;
-	struct page **phys_mem;
+	struct kvm_swap_entry *phys_mem;
 	unsigned long *dirty_bitmap;
 };
 
@@ -347,6 +353,7 @@
 	struct list_head vm_list;
 	struct file *filp;
 };
+#define kvm_to_address_space(kvm) (kvm->filp->f_mapping)
 
 struct kvm_stat {
 	u32 pf_fixed;

[-- Attachment #5: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #6: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <288dbef70705210112t710bc904pe546840f7b9cfcfa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found] ` <288dbef70705210112t710bc904pe546840f7b9cfcfa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-21  8:43   ` Dor Laor
  2007-05-21  9:17   ` Carsten Otte
  2007-05-21  9:17   ` Avi Kivity
  2 siblings, 0 replies; 13+ messages in thread
From: Dor Laor @ 2007-05-21  8:43 UTC (permalink / raw)
  To: Shaohua Li, kvm-devel

>Hi,
>I saw some discussions on the topic but no progress. I did an
>experiment to make guest page be allocated dynamically and swap out.

Wow, that's a very nice/welcomed/hard experience.

>please see attachment patches. It's not yet for merge but I'd like get
>some suggestions and help. Patches (against kvm-19) work here but
>maybe not very stable as there should be some lock issue for swapout,
>which I'll do more check later. If you are brave, please try :). Some
>issues I have:
>1. there is a spinlock to pretoct kvm struct, we can't sleep in it. A
>possible solution is do a 'release lock, sleep and retry', but the
>shadow page fault path sounds not easy to follow it. The spinlock also
>prevents vcpu is migrated to other cpus as vmx operation must be done
>in the cpu vcpu runs. I changed it to a semaphore plus a cpu affinity
>setting. It's a little hacky, I'd see if there are better approaches.

There's a problem in this method: With you patch, after vcpu_load you
don't disable preemption but have the smp_processor_id as the cpu
affinity.
If another executing guest will preempt the cpu and start running, once
the cpu will go back to the initial guest the VMCS data will be
inconsistent.
IMHO the preemption should stay. 

>2. Linux page relcaim can't get if a guest page is referenced often.
>My current patch just bliendly adds guest page to lru, not optimized.
>3. kvm_ops.tlb_flush should really send an IPI to make the vcpu flush
>tlb, as it might be called in other cpus other than the cpu vcpu run.
>This makes the swapout path not be able to zap shadow page tables. My
>patch just skip any guest page which has shadow page table points to.
>I assume kvm smp guest support will improve the tlb_flush.
>
>please cc me for any reply, as I didn't subscribe to the list.
>
>Thanks,
>Shaohua

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC]kvm: swapout guest page
       [not found] ` <288dbef70705210112t710bc904pe546840f7b9cfcfa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-21  8:43   ` Dor Laor
@ 2007-05-21  9:17   ` Carsten Otte
       [not found]     ` <46516392.6070402-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-05-21  9:17   ` Avi Kivity
  2 siblings, 1 reply; 13+ messages in thread
From: Carsten Otte @ 2007-05-21  9:17 UTC (permalink / raw)
  To: Shaohua Li; +Cc: kvm-devel

Shaohua Li wrote:
> +EXPORT_SYMBOL(delete_from_swap_cache);
> +EXPORT_SYMBOL(move_to_swap_cache);
> +EXPORT_SYMBOL(lookup_swap_cache);
> +EXPORT_SYMBOL(read_swap_cache_async);
> +EXPORT_SYMBOL(get_swap_page);
> +EXPORT_SYMBOL(swap_free);
> +EXPORT_SYMBOL(add_to_page_cache_lru);
Use EXPORT_SYMBOL_GPL for all of these.

I fail to see, why this is needed at all. Regular malloc()ed userspace 
memory should be just what we want? The s390host prototype does just 
that, userspace uses malloc() and the memory will be swapped out just 
regular. Am I missing something important?

so long,
Carsten


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <46516392.6070402-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found]     ` <46516392.6070402-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-21  9:20       ` Avi Kivity
       [not found]         ` <46516466.9030904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-05-21 11:51       ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2007-05-21  9:20 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA; +Cc: kvm-devel, Shaohua Li

Carsten Otte wrote:
> Shaohua Li wrote:
>   
>> +EXPORT_SYMBOL(delete_from_swap_cache);
>> +EXPORT_SYMBOL(move_to_swap_cache);
>> +EXPORT_SYMBOL(lookup_swap_cache);
>> +EXPORT_SYMBOL(read_swap_cache_async);
>> +EXPORT_SYMBOL(get_swap_page);
>> +EXPORT_SYMBOL(swap_free);
>> +EXPORT_SYMBOL(add_to_page_cache_lru);
>>     
> Use EXPORT_SYMBOL_GPL for all of these.
>
> I fail to see, why this is needed at all. Regular malloc()ed userspace 
> memory should be just what we want? The s390host prototype does just 
> that, userspace uses malloc() and the memory will be swapped out just 
> regular. Am I missing something important?
>   

For one thing, kvm uses page->private to store its rmap information.  
This is lost if regular mappings are used.

More importantly, both the regular address space and kvm will want to be 
called when a page is paged out, while this is doable, it isn't easy.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <46516466.9030904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found]         ` <46516466.9030904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-21 12:38           ` Carsten Otte
       [not found]             ` <465192DE.3000902-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Carsten Otte @ 2007-05-21 12:38 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm-devel, carsteno-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Shaohua Li

Avi Kivity wrote:
> For one thing, kvm uses page->private to store its rmap information.  
> This is lost if regular mappings are used.
> 
> More importantly, both the regular address space and kvm will want to be 
> called when a page is paged out, while this is doable, it isn't easy.
Taking the long way to call kvm on pageout seems to be preferable. On 
390 we don't need any callback when a page is swapped out. Actually 
the pte is invalidated and next time the guest accesses that page we 
receive a page fault which causes us to swap it back in. Would you 
explain why that callback is needed on x86?

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <465192DE.3000902-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found]             ` <465192DE.3000902-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-21 13:31               ` Avi Kivity
       [not found]                 ` <46519F32.7020808-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2007-05-21 13:31 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: kvm-devel, carsteno-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Shaohua Li

Carsten Otte wrote:
> Avi Kivity wrote:
>> For one thing, kvm uses page->private to store its rmap information.  
>> This is lost if regular mappings are used.
>>
>> More importantly, both the regular address space and kvm will want to 
>> be called when a page is paged out, while this is doable, it isn't easy.
> Taking the long way to call kvm on pageout seems to be preferable. On 
> 390 we don't need any callback when a page is swapped out. Actually 
> the pte is invalidated and next time the guest accesses that page we 
> receive a page fault which causes us to swap it back in. Would you 
> explain why that callback is needed on x86?
>

The pte is stored/cached in two different places (in addition to what 
Linux already knows about):

- in the shadow page tables
- in the tlbs of the vcpus that may have referenced the page

so, when swapping out the page, you need to use the kvm rmap to find all 
shadow ptes which reference the page, and also IPI every processor that 
is running a vcpu belonging to the same virtual machine.

You also need to extend kvm rmap to contain read-only pages (as this 
patchset does).  That's a cost that may have a serious performance impact.

s390 uses the same pte for userspace virtual and guest physical?  that 
explains why a single invalidate suffices for both.  But aren't guest 
virtual translations cached in the tlb as well?

An example: suppose host pfn 7 is allocated as guest pfn 8 (and 
therefore, userspace address 0x8000).  Suppose further the guest maps 
guest pfn 8 to guest virtual 0x10000 and guest virtal 0x11000.  Aren't 
there three tlbs you need to shoot down?  host virtual 0x8000->pfn 7 and 
guest virual 0x10000->pfn 7 and 0x11000->pfn 7?

(Assuming page size is 4k)

-- 
error compiling committee.c: too many arguments to function

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <46519F32.7020808-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found]                 ` <46519F32.7020808-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-21 14:07                   ` Carsten Otte
       [not found]                     ` <4651A7A4.9040702-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Carsten Otte @ 2007-05-21 14:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: carsteno-tA70FqPdS9bQT0dZR+AlfA, kvm-devel, Shaohua Li

Avi Kivity wrote:
> The pte is stored/cached in two different places (in addition to what 
> Linux already knows about):
> 
> - in the shadow page tables
> - in the tlbs of the vcpus that may have referenced the page
> 
> so, when swapping out the page, you need to use the kvm rmap to find all 
> shadow ptes which reference the page, and also IPI every processor that 
> is running a vcpu belonging to the same virtual machine.
> 
> You also need to extend kvm rmap to contain read-only pages (as this 
> patchset does).  That's a cost that may have a serious performance impact.
Very interresting. Thank you for explaining this.

> s390 uses the same pte for userspace virtual and guest physical?  that 
> explains why a single invalidate suffices for both.  But aren't guest 
> virtual translations cached in the tlb as well?
Yes, we use the same pte for both. We also use the same tlb entries 
for both userspace access and guest mode. This way, we don't need to 
invalidate them when entering or exiting the vm context.

> An example: suppose host pfn 7 is allocated as guest pfn 8 (and 
> therefore, userspace address 0x8000).  Suppose further the guest maps 
> guest pfn 8 to guest virtual 0x10000 and guest virtal 0x11000.  Aren't 
> there three tlbs you need to shoot down?  host virtual 0x8000->pfn 7 and 
> guest virual 0x10000->pfn 7 and 0x11000->pfn 7?
So far, we have a 1:1 mapping between guest physical and host 
userspace. A userspace pointer equals a guest real pointer.
Our hardware control block for vcpu also allows to set an offset 
"guest physical + offset = host user". The CPU knows about this offset 
when doing page translations, and this is also transparent with regard 
to tlb entries.
Now if the guest itself enables dynamic address translation, the tlb 
entry can cache information about both page translation steps. This 
process is transparent for both guest and host operating system. If 
the host flushes this tlb entry, the information about the guest 
internal translation is also removed.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <4651A7A4.9040702-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found]                     ` <4651A7A4.9040702-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-21 14:35                       ` Avi Kivity
       [not found]                         ` <4651AE3F.8060603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2007-05-21 14:35 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA; +Cc: kvm-devel, Shaohua Li

Carsten Otte wrote:
>
>> An example: suppose host pfn 7 is allocated as guest pfn 8 (and 
>> therefore, userspace address 0x8000).  Suppose further the guest maps 
>> guest pfn 8 to guest virtual 0x10000 and guest virtal 0x11000.  
>> Aren't there three tlbs you need to shoot down?  host virtual 
>> 0x8000->pfn 7 and guest virual 0x10000->pfn 7 and 0x11000->pfn 7?
> So far, we have a 1:1 mapping between guest physical and host 
> userspace. A userspace pointer equals a guest real pointer.
> Our hardware control block for vcpu also allows to set an offset 
> "guest physical + offset = host user". The CPU knows about this offset 
> when doing page translations, and this is also transparent with regard 
> to tlb entries.
> Now if the guest itself enables dynamic address translation, the tlb 
> entry can cache information about both page translation steps. This 
> process is transparent for both guest and host operating system. If 
> the host flushes this tlb entry, the information about the guest 
> internal translation is also removed.

Interesting.  And if you have multiple guest virtual to guest physical 
translations, the hardware knows to flush them when the host virtual to 
host physical entry is flushed?

[I neglected to mention npt/ept.  For that, you can't keep an rmap, so 
you need to flush the entire guest tlb on a swapout.  You might avoid it 
if the hardware updates the page accessed bit in the tlb (I think AMD 
does, not sure about Intel).]

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <4651AE3F.8060603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found]                         ` <4651AE3F.8060603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-21 14:41                           ` Carsten Otte
       [not found]                             ` <4651AFA6.2060605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Carsten Otte @ 2007-05-21 14:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: carsteno-tA70FqPdS9bQT0dZR+AlfA, kvm-devel, Shaohua Li

Avi Kivity wrote:
> Interesting.  And if you have multiple guest virtual to guest physical 
> translations, the hardware knows to flush them when the host virtual to 
> host physical entry is flushed?
Yes, the cpu flushes all of them.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <4651AFA6.2060605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found]                             ` <4651AFA6.2060605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-21 14:43                               ` Avi Kivity
       [not found]                                 ` <4651AFF7.2080107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2007-05-21 14:43 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA; +Cc: kvm-devel, Shaohua Li

Carsten Otte wrote:
> Avi Kivity wrote:
>> Interesting.  And if you have multiple guest virtual to guest 
>> physical translations, the hardware knows to flush them when the host 
>> virtual to host physical entry is flushed?
> Yes, the cpu flushes all of them.
>

Ooh, I want one too.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <4651AFF7.2080107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [RFC]kvm: swapout guest page
       [not found]                                 ` <4651AFF7.2080107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-22 15:10                                   ` Carsten Otte
  0 siblings, 0 replies; 13+ messages in thread
From: Carsten Otte @ 2007-05-22 15:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm-devel, carsteno-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Shaohua Li

Avi Kivity wrote:
> Ooh, I want one too.
You can get one here: http://www-03.ibm.com/systems/z/os/linux/lcds/
This might get useful the day we merge our stuff into kvm.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC]kvm: swapout guest page
       [not found]     ` <46516392.6070402-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-05-21  9:20       ` Avi Kivity
@ 2007-05-21 11:51       ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2007-05-21 11:51 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA; +Cc: kvm-devel, Shaohua Li

On Mon, May 21, 2007 at 11:17:06AM +0200, Carsten Otte wrote:
> Shaohua Li wrote:
> > +EXPORT_SYMBOL(delete_from_swap_cache);
> > +EXPORT_SYMBOL(move_to_swap_cache);
> > +EXPORT_SYMBOL(lookup_swap_cache);
> > +EXPORT_SYMBOL(read_swap_cache_async);
> > +EXPORT_SYMBOL(get_swap_page);
> > +EXPORT_SYMBOL(swap_free);
> > +EXPORT_SYMBOL(add_to_page_cache_lru);
> Use EXPORT_SYMBOL_GPL for all of these.

Actually they really shouldn't be exported at all.  I don't have time
right now to review the source, but we should have some higher level
abstraction that's always in the kernel which is then used by the
kvm module.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC]kvm: swapout guest page
       [not found] ` <288dbef70705210112t710bc904pe546840f7b9cfcfa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-21  8:43   ` Dor Laor
  2007-05-21  9:17   ` Carsten Otte
@ 2007-05-21  9:17   ` Avi Kivity
  2 siblings, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2007-05-21  9:17 UTC (permalink / raw)
  To: Shaohua Li; +Cc: kvm-devel

Shaohua Li wrote:
> Hi,
> I saw some discussions on the topic but no progress. I did an
> experiment to make guest page be allocated dynamically and swap out.
> please see attachment patches. It's not yet for merge but I'd like get
> some suggestions and help. Patches (against kvm-19) work here but
> maybe not very stable as there should be some lock issue for swapout,
> which I'll do more check later. If you are brave, please try :). 

Nice work.  This is fairly different from what I had in mind - I wanted 
to use regular address spaces in kvm, whereas this patchset adds swapout 
capability to the kvm address space.

Differences between the two approaches include:

- yours is probably simpler :)
- possibly less intrusive code mm changes with using regular address spaces
- automatic hugetlbfs support (this was my main motivation for generic 
address spaces, esp. with npt/ept).  of course hugetlbfs can be 
implemented with your approach as well
- your approach allows kvm to continue using page->private, so it saves 
memory and requires less kvm modification
- using Linux address spaces allows paging to file-backed storage, not 
just swap

Ultimately I think the balance is in favor of your approach, as it is 
more tightly coupled with kvm and can therefore be faster.  The 
simplicity also helps a lot.

> Some
> issues I have:
> 1. there is a spinlock to pretoct kvm struct, we can't sleep in it. A
> possible solution is do a 'release lock, sleep and retry', but the
> shadow page fault path sounds not easy to follow it. The spinlock also
> prevents vcpu is migrated to other cpus as vmx operation must be done
> in the cpu vcpu runs. I changed it to a semaphore plus a cpu affinity
> setting. It's a little hacky, I'd see if there are better approaches.

My plan is to teach the scheduler about kvm, so it can call a callback 
when a vcpu is migrated.  That will allow re-enabling preemption in all 
kvm code except the actual entry/exit sequence.  This is an improvement 
all over (for realtime, for easier coding, for latency) so I hope to to 
it soon.

> 2. Linux page relcaim can't get if a guest page is referenced often.
> My current patch just bliendly adds guest page to lru, not optimized.

Well, that will always be a problem with paging guest memory.  There are 
some patches floating around to allow a guest to give hints to the host 
about page recency, for s390, which may help.

> 3. kvm_ops.tlb_flush should really send an IPI to make the vcpu flush
> tlb, as it might be called in other cpus other than the cpu vcpu run.
> This makes the swapout path not be able to zap shadow page tables. My
> patch just skip any guest page which has shadow page table points to.
> I assume kvm smp guest support will improve the tlb_flush.
>

Yes.  The apic patchset includes mechanisms for interrupting a running 
vcpu which can be used for this.

> @@ -151,9 +151,8 @@
>  		walker->inherited_ar &= walker->table[index];
>  		table_gfn = (*ptep & PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>  		paddr = safe_gpa_to_hpa(vcpu, *ptep & PT_BASE_ADDR_MASK);
> -		kunmap_atomic(walker->table, KM_USER0);
> -		walker->table = kmap_atomic(pfn_to_page(paddr >> PAGE_SHIFT),
> -					    KM_USER0);
> +		kunmap(walker->table);
> +		walker->table = kmap(pfn_to_page(paddr >> PAGE_SHIFT));
>   

kunmap() wants a struct page IIRC.  It's also much slower than the 
atomic variant on i386+HIGHMEM, so I'd rather avoid it.

> @@ -1099,11 +1121,23 @@
>  	}
>  }
>  
> +static void mmu_zap_active_pages(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_mmu_page *page;
> +
> +	while (!list_empty(&vcpu->kvm->active_mmu_pages)) {
> +		page = container_of(vcpu->kvm->active_mmu_pages.next,
> +				    struct kvm_mmu_page, link);
> +		kvm_mmu_zap_page(vcpu, page);
> +	}
> +}
> +
>  int kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>  {
>  	int r;
>  
>  	destroy_kvm_mmu(vcpu);
> +	mmu_zap_active_pages(vcpu);
>  	r = init_kvm_mmu(vcpu);
>  	if (r < 0)
>  		goto out;
>   

This is called on set_cr0(), which can be called fairly often.  However, 
I think it can be qualified on changing the paging related bits.

> Index: kvm/kernel/paging_tmpl.h
> ===================================================================
> --- kvm.orig/kernel/paging_tmpl.h	2007-05-21 09:20:11.000000000 +0800
> +++ kvm/kernel/paging_tmpl.h	2007-05-21 09:20:26.000000000 +0800
> @@ -369,7 +369,7 @@
>  	*shadow_ent |= PT_WRITABLE_MASK;
>  	FNAME(mark_pagetable_dirty)(vcpu->kvm, walker);
>  	*guest_ent |= PT_DIRTY_MASK;
> -	rmap_add(vcpu, shadow_ent);
> +//	rmap_add(vcpu, shadow_ent);
>   

??

> +
> +static void kvm_invalidatepage(struct page *page, unsigned long offset)
> +{
> +	/*
> +	 * truncate_page is done after vcpu_free, that means all shadow page
> +	 * table should be freed already, we should never get here
> +	 */
> +	BUG();
> +}
>   

Eventually we'll want to add support for invalidating a vm page, to 
support ballooning and similar mechanisms.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-05-22 15:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-21  8:12 [RFC]kvm: swapout guest page Shaohua Li
     [not found] ` <288dbef70705210112t710bc904pe546840f7b9cfcfa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-21  8:43   ` Dor Laor
2007-05-21  9:17   ` Carsten Otte
     [not found]     ` <46516392.6070402-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21  9:20       ` Avi Kivity
     [not found]         ` <46516466.9030904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-21 12:38           ` Carsten Otte
     [not found]             ` <465192DE.3000902-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21 13:31               ` Avi Kivity
     [not found]                 ` <46519F32.7020808-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-21 14:07                   ` Carsten Otte
     [not found]                     ` <4651A7A4.9040702-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21 14:35                       ` Avi Kivity
     [not found]                         ` <4651AE3F.8060603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-21 14:41                           ` Carsten Otte
     [not found]                             ` <4651AFA6.2060605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21 14:43                               ` Avi Kivity
     [not found]                                 ` <4651AFF7.2080107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-22 15:10                                   ` Carsten Otte
2007-05-21 11:51       ` Christoph Hellwig
2007-05-21  9:17   ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox