* [RFC]kvm: swapout guest page
@ 2007-05-21 8:12 Shaohua Li
[not found] ` <288dbef70705210112t710bc904pe546840f7b9cfcfa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2007-05-21 8:12 UTC (permalink / raw)
To: kvm-devel
[-- Attachment #1: Type: text/plain, Size: 1404 bytes --]
Hi,
I saw some discussions on the topic but no progress. I did an
experiment to make guest page be allocated dynamically and swap out.
please see attachment patches. It's not yet for merge but I'd like get
some suggestions and help. Patches (against kvm-19) work here but
maybe not very stable as there should be some lock issue for swapout,
which I'll do more check later. If you are brave, please try :). Some
issues I have:
1. there is a spinlock to pretoct kvm struct, we can't sleep in it. A
possible solution is do a 'release lock, sleep and retry', but the
shadow page fault path sounds not easy to follow it. The spinlock also
prevents vcpu is migrated to other cpus as vmx operation must be done
in the cpu vcpu runs. I changed it to a semaphore plus a cpu affinity
setting. It's a little hacky, I'd see if there are better approaches.
2. Linux page relcaim can't get if a guest page is referenced often.
My current patch just bliendly adds guest page to lru, not optimized.
3. kvm_ops.tlb_flush should really send an IPI to make the vcpu flush
tlb, as it might be called in other cpus other than the cpu vcpu run.
This makes the swapout path not be able to zap shadow page tables. My
patch just skip any guest page which has shadow page table points to.
I assume kvm smp guest support will improve the tlb_flush.
please cc me for any reply, as I didn't subscribe to the list.
Thanks,
Shaohua
[-- Attachment #2: export-symbol.patch --]
[-- Type: text/x-patch, Size: 2109 bytes --]
symbols swapout required
Index: 2.6.21-rc7/mm/swap_state.c
===================================================================
--- 2.6.21-rc7.orig/mm/swap_state.c 2007-04-24 02:20:00.000000000 +0800
+++ 2.6.21-rc7/mm/swap_state.c 2007-05-21 10:10:20.000000000 +0800
@@ -207,6 +207,7 @@ void delete_from_swap_cache(struct page
swap_free(entry);
page_cache_release(page);
}
+EXPORT_SYMBOL(delete_from_swap_cache);
/*
* Strange swizzling function only for use by shmem_writepage
@@ -225,6 +226,7 @@ int move_to_swap_cache(struct page *page
INC_CACHE_INFO(exist_race);
return err;
}
+EXPORT_SYMBOL(move_to_swap_cache);
/*
* Strange swizzling function for shmem_getpage (and shmem_unuse)
@@ -307,6 +309,7 @@ struct page * lookup_swap_cache(swp_entr
INC_CACHE_INFO(find_total);
return page;
}
+EXPORT_SYMBOL(lookup_swap_cache);
/*
* Locate a page of swap in physical memory, reserving swap cache space
@@ -364,3 +367,4 @@ struct page *read_swap_cache_async(swp_e
page_cache_release(new_page);
return found_page;
}
+EXPORT_SYMBOL(read_swap_cache_async);
Index: 2.6.21-rc7/mm/swapfile.c
===================================================================
--- 2.6.21-rc7.orig/mm/swapfile.c 2007-04-24 02:20:00.000000000 +0800
+++ 2.6.21-rc7/mm/swapfile.c 2007-05-21 10:10:20.000000000 +0800
@@ -211,6 +211,7 @@ noswap:
spin_unlock(&swap_lock);
return (swp_entry_t) {0};
}
+EXPORT_SYMBOL(get_swap_page);
swp_entry_t get_swap_page_of_type(int type)
{
@@ -303,6 +304,7 @@ void swap_free(swp_entry_t entry)
spin_unlock(&swap_lock);
}
}
+EXPORT_SYMBOL(swap_free);
/*
* How many references to page are currently swapped out?
Index: 2.6.21-rc7/mm/filemap.c
===================================================================
--- 2.6.21-rc7.orig/mm/filemap.c 2007-04-24 02:20:00.000000000 +0800
+++ 2.6.21-rc7/mm/filemap.c 2007-05-21 10:11:09.000000000 +0800
@@ -465,6 +465,7 @@ int add_to_page_cache_lru(struct page *p
lru_cache_add(page);
return ret;
}
+EXPORT_SYMBOL(add_to_page_cache_lru);
#ifdef CONFIG_NUMA
struct page *__page_cache_alloc(gfp_t gfp)
[-- Attachment #3: mutex.patch --]
[-- Type: text/x-patch, Size: 11272 bytes --]
against kvm-19
kvm lock is a spinlock, changed it to a mutex so can sleep in some pathes.
The kvm pagefault path is not easy to convert to a 'release lock and retry'.
Index: kvm/kernel/kvm.h
===================================================================
--- kvm.orig/kernel/kvm.h 2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/kvm.h 2007-05-21 09:12:23.000000000 +0800
@@ -238,6 +238,7 @@
struct kvm_vcpu {
struct kvm *kvm;
+ cpumask_t saved_mask;
union {
struct vmcs *vmcs;
struct vcpu_svm *svm;
@@ -328,7 +329,7 @@
};
struct kvm {
- spinlock_t lock; /* protects everything except vcpus */
+ struct mutex lock; /* protects everything except vcpus */
int naliases;
struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
int nmemslots;
Index: kvm/kernel/kvm_main.c
===================================================================
--- kvm.orig/kernel/kvm_main.c 2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/kvm_main.c 2007-05-21 09:16:32.000000000 +0800
@@ -292,7 +292,7 @@
if (!kvm)
return ERR_PTR(-ENOMEM);
- spin_lock_init(&kvm->lock);
+ mutex_init(&kvm->lock);
INIT_LIST_HEAD(&kvm->active_mmu_pages);
for (i = 0; i < KVM_MAX_VCPUS; ++i) {
struct kvm_vcpu *vcpu = &kvm->vcpus[i];
@@ -422,7 +422,7 @@
int ret;
struct page *page;
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
page = gfn_to_page(vcpu->kvm, pdpt_gfn);
/* FIXME: !page - emulate? 0xff? */
pdpt = kmap_atomic(page, KM_USER0);
@@ -441,7 +441,7 @@
out:
kunmap_atomic(pdpt, KM_USER0);
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
return ret;
}
@@ -501,9 +501,9 @@
kvm_arch_ops->set_cr0(vcpu, cr0);
vcpu->cr0 = cr0;
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
kvm_mmu_reset_context(vcpu);
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
return;
}
EXPORT_SYMBOL_GPL(set_cr0);
@@ -542,9 +542,9 @@
return;
}
kvm_arch_ops->set_cr4(vcpu, cr4);
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
kvm_mmu_reset_context(vcpu);
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
}
EXPORT_SYMBOL_GPL(set_cr4);
@@ -572,7 +572,7 @@
}
vcpu->cr3 = cr3;
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
/*
* Does the new cr3 value map to physical memory? (Note, we
* catch an invalid cr3 even in real-mode, because it would
@@ -586,7 +586,7 @@
inject_gp(vcpu);
else
vcpu->mmu.new_cr3(vcpu);
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
}
EXPORT_SYMBOL_GPL(set_cr3);
@@ -629,9 +629,9 @@
static void do_remove_write_access(struct kvm_vcpu *vcpu, int slot)
{
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
kvm_mmu_slot_remove_write_access(vcpu, slot);
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
}
/*
@@ -670,7 +670,7 @@
mem->flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
raced:
- spin_lock(&kvm->lock);
+ mutex_lock(&kvm->lock);
memory_config_version = kvm->memory_config_version;
new = old = *memslot;
@@ -699,7 +699,7 @@
* Do memory allocations outside lock. memory_config_version will
* detect any races.
*/
- spin_unlock(&kvm->lock);
+ mutex_unlock(&kvm->lock);
/* Deallocate if slot is being removed */
if (!npages)
@@ -738,10 +738,10 @@
memset(new.dirty_bitmap, 0, dirty_bytes);
}
- spin_lock(&kvm->lock);
+ mutex_lock(&kvm->lock);
if (memory_config_version != kvm->memory_config_version) {
- spin_unlock(&kvm->lock);
+ mutex_unlock(&kvm->lock);
kvm_free_physmem_slot(&new, &old);
goto raced;
}
@@ -756,7 +756,7 @@
*memslot = new;
++kvm->memory_config_version;
- spin_unlock(&kvm->lock);
+ mutex_unlock(&kvm->lock);
for (i = 0; i < KVM_MAX_VCPUS; ++i) {
struct kvm_vcpu *vcpu;
@@ -774,7 +774,7 @@
return 0;
out_unlock:
- spin_unlock(&kvm->lock);
+ mutex_unlock(&kvm->lock);
out_free:
kvm_free_physmem_slot(&new, &old);
out:
@@ -793,14 +793,14 @@
int cleared;
unsigned long any = 0;
- spin_lock(&kvm->lock);
+ mutex_lock(&kvm->lock);
/*
* Prevent changes to guest memory configuration even while the lock
* is not taken.
*/
++kvm->busy;
- spin_unlock(&kvm->lock);
+ mutex_unlock(&kvm->lock);
r = -EINVAL;
if (log->slot >= KVM_MEMORY_SLOTS)
goto out;
@@ -840,9 +840,9 @@
r = 0;
out:
- spin_lock(&kvm->lock);
+ mutex_lock(&kvm->lock);
--kvm->busy;
- spin_unlock(&kvm->lock);
+ mutex_unlock(&kvm->lock);
return r;
}
@@ -872,7 +872,7 @@
< alias->target_phys_addr)
goto out;
- spin_lock(&kvm->lock);
+ mutex_lock(&kvm->lock);
p = &kvm->aliases[alias->slot];
p->base_gfn = alias->guest_phys_addr >> PAGE_SHIFT;
@@ -884,12 +884,12 @@
break;
kvm->naliases = n;
- spin_unlock(&kvm->lock);
+ mutex_unlock(&kvm->lock);
vcpu_load(&kvm->vcpus[0]);
- spin_lock(&kvm->lock);
+ mutex_lock(&kvm->lock);
kvm_mmu_zap_all(&kvm->vcpus[0]);
- spin_unlock(&kvm->lock);
+ mutex_unlock(&kvm->lock);
vcpu_put(&kvm->vcpus[0]);
return 0;
@@ -1408,7 +1408,7 @@
mark_page_dirty(vcpu->kvm, para_state_gpa >> PAGE_SHIFT);
para_state_page = pfn_to_page(para_state_hpa >> PAGE_SHIFT);
- para_state = kmap_atomic(para_state_page, KM_USER0);
+ para_state = kmap(para_state_page);
printk(KERN_DEBUG ".... guest version: %d\n", para_state->guest_version);
printk(KERN_DEBUG ".... size: %d\n", para_state->size);
@@ -1444,7 +1444,7 @@
para_state->ret = 0;
err_kunmap_skip:
- kunmap_atomic(para_state, KM_USER0);
+ kunmap(para_state);
return 0;
err_gp:
return 1;
@@ -1792,12 +1792,12 @@
vcpu->pio.cur_count = now;
for (i = 0; i < nr_pages; ++i) {
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
page = gva_to_page(vcpu, address + i * PAGE_SIZE);
if (page)
get_page(page);
vcpu->pio.guest_pages[i] = page;
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
if (!page) {
inject_gp(vcpu);
free_pio_guest_pages(vcpu);
@@ -2170,13 +2170,13 @@
gpa_t gpa;
vcpu_load(vcpu);
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
gpa = vcpu->mmu.gva_to_gpa(vcpu, vaddr);
tr->physical_address = gpa;
tr->valid = gpa != UNMAPPED_GVA;
tr->writeable = 1;
tr->usermode = 0;
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
vcpu_put(vcpu);
return 0;
Index: kvm/kernel/mmu.c
===================================================================
--- kvm.orig/kernel/mmu.c 2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/mmu.c 2007-05-21 09:12:23.000000000 +0800
@@ -241,11 +241,11 @@
r = __mmu_topup_memory_caches(vcpu, GFP_NOWAIT);
if (r < 0) {
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
kvm_arch_ops->vcpu_put(vcpu);
r = __mmu_topup_memory_caches(vcpu, GFP_KERNEL);
kvm_arch_ops->vcpu_load(vcpu);
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
}
return r;
}
Index: kvm/kernel/svm.c
===================================================================
--- kvm.orig/kernel/svm.c 2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/svm.c 2007-05-18 09:02:10.000000000 +0800
@@ -612,7 +612,14 @@
{
int cpu;
- cpu = get_cpu();
+ if (cpus_empty(vcpu->saved_mask)) {
+ vcpu->saved_mask = current->cpus_allowed;
+ set_cpus_allowed(current, cpumask_of_cpu(smp_processor_id()));
+ } else {
+ printk("nest vcpu load\n");
+ dump_stack();
+ }
+ cpu = smp_processor_id();
if (unlikely(cpu != vcpu->cpu)) {
u64 tsc_this, delta;
@@ -630,7 +637,8 @@
static void svm_vcpu_put(struct kvm_vcpu *vcpu)
{
rdtscll(vcpu->host_tsc);
- put_cpu();
+ set_cpus_allowed(current, vcpu->saved_mask);
+ cpus_clear(vcpu->saved_mask);
}
static void svm_vcpu_decache(struct kvm_vcpu *vcpu)
@@ -894,21 +902,21 @@
if (is_external_interrupt(exit_int_info))
push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
fault_address = vcpu->svm->vmcb->control.exit_info_2;
error_code = vcpu->svm->vmcb->control.exit_info_1;
r = kvm_mmu_page_fault(vcpu, fault_address, error_code);
if (r < 0) {
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
return r;
}
if (!r) {
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
return 1;
}
er = emulate_instruction(vcpu, kvm_run, fault_address, error_code);
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
switch (er) {
case EMULATE_DONE:
Index: kvm/kernel/vmx.c
===================================================================
--- kvm.orig/kernel/vmx.c 2007-04-16 21:13:52.000000000 +0800
+++ kvm/kernel/vmx.c 2007-05-18 09:02:10.000000000 +0800
@@ -209,7 +209,14 @@
u64 phys_addr = __pa(vcpu->vmcs);
int cpu;
- cpu = get_cpu();
+ if (cpus_empty(vcpu->saved_mask)) {
+ vcpu->saved_mask = current->cpus_allowed;
+ set_cpus_allowed(current, cpumask_of_cpu(smp_processor_id()));
+ } else {
+ printk("nest vcpu load\n");
+ dump_stack();
+ }
+ cpu = smp_processor_id();
if (vcpu->cpu != cpu)
vcpu_clear(vcpu);
@@ -246,7 +253,8 @@
static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
{
- put_cpu();
+ set_cpus_allowed(current, vcpu->saved_mask);
+ cpus_clear(vcpu->saved_mask);
}
static void vmx_vcpu_decache(struct kvm_vcpu *vcpu)
@@ -1329,19 +1337,19 @@
if (is_page_fault(intr_info)) {
cr2 = vmcs_readl(EXIT_QUALIFICATION);
- spin_lock(&vcpu->kvm->lock);
+ mutex_lock(&vcpu->kvm->lock);
r = kvm_mmu_page_fault(vcpu, cr2, error_code);
if (r < 0) {
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
return r;
}
if (!r) {
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
return 1;
}
er = emulate_instruction(vcpu, kvm_run, cr2, error_code);
- spin_unlock(&vcpu->kvm->lock);
+ mutex_unlock(&vcpu->kvm->lock);
switch (er) {
case EMULATE_DONE:
Index: kvm/kernel/paging_tmpl.h
===================================================================
--- kvm.orig/kernel/paging_tmpl.h 2007-05-21 09:12:23.000000000 +0800
+++ kvm/kernel/paging_tmpl.h 2007-05-21 09:14:54.000000000 +0800
@@ -98,7 +98,7 @@
walker->level - 1, table_gfn);
slot = gfn_to_memslot(vcpu->kvm, table_gfn);
hpa = safe_gpa_to_hpa(vcpu, root & PT64_BASE_ADDR_MASK);
- walker->table = kmap_atomic(pfn_to_page(hpa >> PAGE_SHIFT), KM_USER0);
+ walker->table = kmap(pfn_to_page(hpa >> PAGE_SHIFT));
ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) ||
(vcpu->cr3 & ~(PAGE_MASK | CR3_FLAGS_MASK)) == 0);
@@ -151,9 +151,8 @@
walker->inherited_ar &= walker->table[index];
table_gfn = (*ptep & PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
paddr = safe_gpa_to_hpa(vcpu, *ptep & PT_BASE_ADDR_MASK);
- kunmap_atomic(walker->table, KM_USER0);
- walker->table = kmap_atomic(pfn_to_page(paddr >> PAGE_SHIFT),
- KM_USER0);
+ kunmap(walker->table);
+ walker->table = kmap(pfn_to_page(paddr >> PAGE_SHIFT));
--walker->level;
walker->table_gfn[walker->level - 1 ] = table_gfn;
pgprintk("%s: table_gfn[%d] %lx\n", __FUNCTION__,
@@ -183,7 +182,7 @@
static void FNAME(release_walker)(struct guest_walker *walker)
{
if (walker->table)
- kunmap_atomic(walker->table, KM_USER0);
+ kunmap(walker->table);
}
static void FNAME(mark_pagetable_dirty)(struct kvm *kvm,
[-- Attachment #4: swap-guest-page.patch --]
[-- Type: text/x-patch, Size: 13811 bytes --]
against kvm-19
permit guest page to be allocated dynamically and to be swaped out.
Index: kvm/kernel/mmu.c
===================================================================
--- kvm.orig/kernel/mmu.c 2007-05-21 09:20:11.000000000 +0800
+++ kvm/kernel/mmu.c 2007-05-21 09:20:26.000000000 +0800
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/highmem.h>
#include <linux/module.h>
+#include <linux/pagemap.h>
#include "vmx.h"
#include "kvm.h"
@@ -194,6 +195,7 @@
static int is_rmap_pte(u64 pte)
{
+ return 1;
return (pte & (PT_WRITABLE_MASK | PT_PRESENT_MASK))
== (PT_WRITABLE_MASK | PT_PRESENT_MASK);
}
@@ -320,6 +322,8 @@
if (!page_private(page)) {
rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
set_page_private(page,(unsigned long)spte);
+ SetPagePrivate(page);
+ page_cache_get(page);
} else if (!(page_private(page) & 1)) {
rmap_printk("rmap_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_rmap_desc(vcpu);
@@ -355,9 +359,13 @@
desc->shadow_ptes[j] = NULL;
if (j != 0)
return;
- if (!prev_desc && !desc->more)
+ if (!prev_desc && !desc->more) {
set_page_private(page,(unsigned long)desc->shadow_ptes[0]);
- else
+ if (page_private(page) == 0) {
+ ClearPagePrivate(page);
+ page_cache_release(page);
+ }
+ } else
if (prev_desc)
prev_desc->more = desc->more;
else
@@ -386,6 +394,8 @@
BUG();
}
set_page_private(page,0);
+ ClearPagePrivate(page);
+ page_cache_release(page);
} else {
rmap_printk("rmap_remove: %p %llx many->many\n", spte, *spte);
desc = (struct kvm_rmap_desc *)(page_private(page) & ~1ul);
@@ -405,32 +415,44 @@
}
}
+static void rmap_write_protect_one(struct kvm_vcpu *vcpu, u64 *spte, struct page *page)
+{
+ BUG_ON(!spte);
+ BUG_ON((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT
+ != page_to_pfn(page));
+ BUG_ON(!(*spte & PT_PRESENT_MASK));
+// BUG_ON(!(*spte & PT_WRITABLE_MASK));
+ rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
+// rmap_remove(vcpu, spte);
+ *spte &= ~(u64)PT_WRITABLE_MASK;
+ kvm_arch_ops->tlb_flush(vcpu);
+}
+
static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
{
struct kvm *kvm = vcpu->kvm;
struct page *page;
struct kvm_rmap_desc *desc;
u64 *spte;
+ int i;
page = gfn_to_page(kvm, gfn);
BUG_ON(!page);
- while (page_private(page)) {
- if (!(page_private(page) & 1))
- spte = (u64 *)page_private(page);
- else {
- desc = (struct kvm_rmap_desc *)(page_private(page) & ~1ul);
- spte = desc->shadow_ptes[0];
+ if (!page_private(page))
+ return;
+ if (!(page_private(page) & 1)) {
+ spte = (u64 *)page_private(page);
+ rmap_write_protect_one(vcpu, spte, page);
+ return;
+ }
+ desc = (struct kvm_rmap_desc *)(page_private(page) & ~1ul);
+ while (desc) {
+ for (i = 0; i < RMAP_EXT && desc->shadow_ptes[i]; i++) {
+ spte = desc->shadow_ptes[i];
+ rmap_write_protect_one(vcpu, spte, page);
}
- BUG_ON(!spte);
- BUG_ON((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT
- != page_to_pfn(page));
- BUG_ON(!(*spte & PT_PRESENT_MASK));
- BUG_ON(!(*spte & PT_WRITABLE_MASK));
- rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
- rmap_remove(vcpu, spte);
- kvm_arch_ops->tlb_flush(vcpu);
- *spte &= ~(u64)PT_WRITABLE_MASK;
+ desc = desc->more;
}
}
@@ -1099,11 +1121,23 @@
}
}
+static void mmu_zap_active_pages(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_page *page;
+
+ while (!list_empty(&vcpu->kvm->active_mmu_pages)) {
+ page = container_of(vcpu->kvm->active_mmu_pages.next,
+ struct kvm_mmu_page, link);
+ kvm_mmu_zap_page(vcpu, page);
+ }
+}
+
int kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
{
int r;
destroy_kvm_mmu(vcpu);
+ mmu_zap_active_pages(vcpu);
r = init_kvm_mmu(vcpu);
if (r < 0)
goto out;
@@ -1231,11 +1265,8 @@
{
struct kvm_mmu_page *page;
- while (!list_empty(&vcpu->kvm->active_mmu_pages)) {
- page = container_of(vcpu->kvm->active_mmu_pages.next,
- struct kvm_mmu_page, link);
- kvm_mmu_zap_page(vcpu, page);
- }
+ mmu_zap_active_pages(vcpu);
+
while (!list_empty(&vcpu->free_pages)) {
page = list_entry(vcpu->free_pages.next,
struct kvm_mmu_page, link);
@@ -1328,7 +1359,7 @@
for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
/* avoid RMW */
if (pt[i] & PT_WRITABLE_MASK) {
- rmap_remove(vcpu, &pt[i]);
+// rmap_remove(vcpu, &pt[i]);
pt[i] &= ~PT_WRITABLE_MASK;
}
}
@@ -1538,3 +1569,30 @@
}
#endif
+
+void rmap_zap_pagetbl(struct kvm_vcpu *vcpu, u64 gfn)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_rmap_desc *desc;
+ struct page *page;
+ u64 *spte;
+
+ page = gfn_to_page(kvm, gfn);
+ BUG_ON(!page);
+
+ while (page_private(page)) {
+ if (!(page_private(page) & 1))
+ spte = (u64 *)page_private(page);
+ else {
+ desc = (struct kvm_rmap_desc *)(page_private(page) & ~1ul);
+ spte = desc->shadow_ptes[0];
+ }
+ BUG_ON(!spte);
+ BUG_ON((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT
+ != page_to_pfn(page));
+ BUG_ON(!(*spte & PT_PRESENT_MASK));
+ rmap_remove(vcpu, spte);
+ kvm_arch_ops->tlb_flush(vcpu);
+ *spte = 0;
+ }
+}
Index: kvm/kernel/paging_tmpl.h
===================================================================
--- kvm.orig/kernel/paging_tmpl.h 2007-05-21 09:20:11.000000000 +0800
+++ kvm/kernel/paging_tmpl.h 2007-05-21 09:20:26.000000000 +0800
@@ -369,7 +369,7 @@
*shadow_ent |= PT_WRITABLE_MASK;
FNAME(mark_pagetable_dirty)(vcpu->kvm, walker);
*guest_ent |= PT_DIRTY_MASK;
- rmap_add(vcpu, shadow_ent);
+// rmap_add(vcpu, shadow_ent);
return 1;
}
Index: kvm/kernel/kvm_main.c
===================================================================
--- kvm.orig/kernel/kvm_main.c 2007-05-21 09:20:11.000000000 +0800
+++ kvm/kernel/kvm_main.c 2007-05-21 09:58:39.000000000 +0800
@@ -26,6 +26,7 @@
#include <linux/gfp.h>
#include <asm/msr.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
#include <linux/miscdevice.h>
#include <linux/vmalloc.h>
#include <asm/uaccess.h>
@@ -322,13 +323,15 @@
{
int i;
- if (!dont || free->phys_mem != dont->phys_mem)
- if (free->phys_mem) {
- for (i = 0; i < free->npages; ++i)
- if (free->phys_mem[i])
- __free_page(free->phys_mem[i]);
- vfree(free->phys_mem);
+ if ((!dont || free->phys_mem != dont->phys_mem) && free->phys_mem) {
+ for (i = 0; i < free->npages; ++i) {
+ if (free->phys_mem[i].entry.val) {
+ printk("free entry %d\n", free->phys_mem[i].entry.val);
+ swap_free(free->phys_mem[i].entry);
+ }
}
+ vfree(free->phys_mem);
+ }
if (!dont || free->dirty_bitmap != dont->dirty_bitmap)
vfree(free->dirty_bitmap);
@@ -388,10 +391,17 @@
static void kvm_destroy_vm(struct kvm *kvm)
{
+ struct inode *inode = kvm_to_address_space(kvm)->host;
+
spin_lock(&kvm_lock);
list_del(&kvm->vm_list);
spin_unlock(&kvm_lock);
kvm_free_vcpus(kvm);
+
+ mutex_lock(&inode->i_mutex);
+ truncate_inode_pages(inode->i_mapping, 0);
+ mutex_unlock(&inode->i_mutex);
+
kvm_free_physmem(kvm);
kfree(kvm);
}
@@ -713,19 +723,12 @@
/* Allocate if a slot is being created */
if (npages && !new.phys_mem) {
- new.phys_mem = vmalloc(npages * sizeof(struct page *));
+ new.phys_mem = vmalloc(npages * sizeof(struct kvm_swap_entry));
if (!new.phys_mem)
goto out_free;
- memset(new.phys_mem, 0, npages * sizeof(struct page *));
- for (i = 0; i < npages; ++i) {
- new.phys_mem[i] = alloc_page(GFP_HIGHUSER
- | __GFP_ZERO);
- if (!new.phys_mem[i])
- goto out_free;
- set_page_private(new.phys_mem[i],0);
- }
+ memset(new.phys_mem, 0, npages * sizeof(struct kvm_swap_entry));
}
/* Allocate page dirty bitmap if needed */
@@ -932,15 +935,105 @@
return __gfn_to_memslot(kvm, gfn);
}
-struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
+static struct page *kvm_swapin_page(struct kvm *kvm, gfn_t gfn)
{
struct kvm_memory_slot *slot;
+ struct kvm_swap_entry *entry;
+ struct address_space *mapping = kvm_to_address_space(kvm);
+ struct page *page;
- gfn = unalias_gfn(kvm, gfn);
slot = __gfn_to_memslot(kvm, gfn);
if (!slot)
return NULL;
- return slot->phys_mem[gfn - slot->base_gfn];
+ entry = &slot->phys_mem[gfn - slot->base_gfn];
+ if (entry->entry.val) {
+ /* page is in swap, read page from swap */
+repeat:
+ page = lookup_swap_cache(entry->entry);
+ if (!page) {
+ page = read_swap_cache_async(entry->entry, NULL, 0);
+ if (!page)
+ return NULL;
+ printk("read page from swap %d\n", gfn);
+ wait_on_page_locked(page);
+ if (!PageUptodate(page)) {
+ page_cache_release(page);
+ return NULL;
+ }
+ }
+ while (TestSetPageLocked(page))
+ wait_on_page_locked(page);
+
+ if (PageWriteback(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ delete_from_swap_cache(page);
+ unlock_page(page);
+ swap_free(entry->entry);
+ entry->entry.val = 0;
+ if (add_to_page_cache(page, mapping, gfn, GFP_ATOMIC))
+ return NULL;
+ } else {
+ /* allocate new page */
+ page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+ if (!page)
+ return NULL;
+ if (add_to_page_cache_lru(page, mapping, gfn, GFP_ATOMIC)) {
+ page_cache_release(page);
+ return NULL;
+ }
+ set_page_private(page, 0);
+ }
+ return page;
+}
+
+#define address_space_to_kvm(m) (m->host->i_private)
+static int kvm_move_to_swap(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct kvm *kvm = address_space_to_kvm(mapping);
+ struct kvm_memory_slot *slot;
+ gfn_t gfn = page->index;
+ swp_entry_t swap;
+
+ swap = get_swap_page();
+ if (!swap.val)
+ goto redirty;
+
+ printk("move page to swap %d\n", page->index);
+ if (move_to_swap_cache(page, swap) == 0) {
+ slot = __gfn_to_memslot(kvm, gfn);
+ slot->phys_mem[gfn - slot->base_gfn].entry = swap;
+ return 0;
+ }
+ swap_free(swap);
+redirty:
+ return AOP_WRITEPAGE_ACTIVATE;
+}
+
+struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
+{
+ struct address_space *mapping = kvm_to_address_space(kvm);
+ struct page *page;
+
+ gfn = unalias_gfn(kvm, gfn);
+
+ page = find_get_page(mapping, gfn);
+ if (page) {
+ page_cache_release(page);
+ return page;
+ }
+ page = kvm_swapin_page(kvm, gfn);
+ if (!page)
+ return NULL;
+ set_page_dirty(page);
+ /* page's ref cnt is 2 */
+ unlock_page(page);
+ page_cache_release(page);
+ return page;
}
EXPORT_SYMBOL_GPL(gfn_to_page);
@@ -2711,6 +2804,7 @@
static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
{
+ file_accessed(file);
vma->vm_ops = &kvm_vm_vm_ops;
return 0;
}
@@ -2722,6 +2816,74 @@
.mmap = kvm_vm_mmap,
};
+static int kvm_set_page_dirty(struct page *page)
+{
+ if (!PageDirty(page))
+ SetPageDirty(page);
+ return 0;
+}
+
+extern void rmap_zap_pagetbl(struct kvm_vcpu *vcpu, u64 gfn);
+
+static int kvm_writepage(struct page *page, struct writeback_control *wbc)
+{
+ struct address_space *mapping = page->mapping;
+ struct kvm *kvm = address_space_to_kvm(mapping);
+ int ret = 0;
+
+ printk(KERN_ERR"page wrie back %d, private %d, count %d\n", page->index, PagePrivate(page), page_count(page));
+
+ mutex_lock(&kvm->lock);
+#if 0
+ /* FIXME: get kvm lock and this must run in the CPU as the kvcpu or the vcpu is not in running mode */
+ /* This will clear PagePrivate */
+ if (PagePrivate(page))
+ rmap_zap_pagetbl(&kvm->vcpus[0], page->index);
+#else
+ /* Maybe just drop this page */
+ if (PagePrivate(page)) {
+ ret = AOP_WRITEPAGE_ACTIVATE;
+ set_page_dirty(page);
+ goto out;
+ }
+#endif
+
+ kvm_move_to_swap(page);
+ unlock_page(page);
+out:
+ mutex_unlock(&kvm->lock);
+
+ return ret;
+}
+
+static int kvm_releasepage(struct page *page, gfp_t gfp)
+{
+ /* writepage removes shadow page table, we should never get here */
+ BUG();
+ return 1;
+}
+
+static void kvm_invalidatepage(struct page *page, unsigned long offset)
+{
+ /*
+ * truncate_page is done after vcpu_free, that means all shadow page
+ * table should be freed already, we should never get here
+ */
+ BUG();
+}
+
+static struct address_space_operations kvm_aops = {
+ .releasepage = kvm_releasepage,
+ .invalidatepage = kvm_invalidatepage,
+ .writepage = kvm_writepage,
+ .set_page_dirty = kvm_set_page_dirty,
+};
+
+static struct backing_dev_info kvm_backing_dev_info __read_mostly = {
+ .ra_pages = 0, /* No readahead */
+ .capabilities = BDI_CAP_NO_ACCT_DIRTY|BDI_CAP_NO_WRITEBACK,
+ .unplug_io_fn = default_unplug_io_fn,
+};
static int kvm_dev_ioctl_create_vm(void)
{
int fd, r;
@@ -2735,11 +2897,15 @@
goto out1;
}
+ inode->i_mapping->a_ops = &kvm_aops;
+ inode->i_mapping->backing_dev_info = &kvm_backing_dev_info;
+
kvm = kvm_create_vm();
if (IS_ERR(kvm)) {
r = PTR_ERR(kvm);
goto out2;
}
+ inode->i_private = kvm;
file = kvmfs_file(inode, kvm);
if (IS_ERR(file)) {
Index: kvm/kernel/kvm.h
===================================================================
--- kvm.orig/kernel/kvm.h 2007-05-21 09:20:11.000000000 +0800
+++ kvm/kernel/kvm.h 2007-05-21 09:20:26.000000000 +0800
@@ -11,6 +11,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/mm.h>
+#include <linux/swap.h>
#include "vmx.h"
#include <linux/kvm.h>
@@ -320,11 +321,16 @@
gfn_t target_gfn;
};
+struct kvm_swap_entry {
+// struct page *page;
+ swp_entry_t entry;
+};
+
struct kvm_memory_slot {
gfn_t base_gfn;
unsigned long npages;
unsigned long flags;
- struct page **phys_mem;
+ struct kvm_swap_entry *phys_mem;
unsigned long *dirty_bitmap;
};
@@ -347,6 +353,7 @@
struct list_head vm_list;
struct file *filp;
};
+#define kvm_to_address_space(kvm) (kvm->filp->f_mapping)
struct kvm_stat {
u32 pf_fixed;
[-- Attachment #5: Type: text/plain, Size: 286 bytes --]
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
[-- Attachment #6: Type: text/plain, Size: 186 bytes --]
_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2007-05-22 15:10 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-21 8:12 [RFC]kvm: swapout guest page Shaohua Li
[not found] ` <288dbef70705210112t710bc904pe546840f7b9cfcfa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-21 8:43 ` Dor Laor
2007-05-21 9:17 ` Carsten Otte
[not found] ` <46516392.6070402-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21 9:20 ` Avi Kivity
[not found] ` <46516466.9030904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-21 12:38 ` Carsten Otte
[not found] ` <465192DE.3000902-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21 13:31 ` Avi Kivity
[not found] ` <46519F32.7020808-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-21 14:07 ` Carsten Otte
[not found] ` <4651A7A4.9040702-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21 14:35 ` Avi Kivity
[not found] ` <4651AE3F.8060603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-21 14:41 ` Carsten Otte
[not found] ` <4651AFA6.2060605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21 14:43 ` Avi Kivity
[not found] ` <4651AFF7.2080107-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-22 15:10 ` Carsten Otte
2007-05-21 11:51 ` Christoph Hellwig
2007-05-21 9:17 ` Avi Kivity
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox