Re: [announce] [patch] KVM paravirtualization for Linux

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Avi Kivity <avi@qumranet.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: kvm-devel <kvm-devel@lists.sourceforge.net>,
	linux-kernel@vger.kernel.org
Subject: Re: [announce] [patch] KVM paravirtualization for Linux
Date: Sun, 07 Jan 2007 14:20:22 +0200	[thread overview]
Message-ID: <45A0E586.50806@qumranet.com> (raw)
In-Reply-To: <20070105215223.GA5361@elte.hu>

Ingo Molnar wrote:
> i'm pleased to announce the first release of paravirtualized KVM (Linux 
> under Linux), which includes support for the hardware cr3-cache feature 
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
>    http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the 
> code is stable enough in my testing and i'd like to have some feedback. 
>
> Firstly, here are some numbers:
>
> 2-task context-switch performance (in microseconds, lower is better):
>
>  native:                       1.11
>  ----------------------------------
>  Qemu:                        61.18
>  KVM upstream:                53.01
>  KVM trunk:                    6.36
>  KVM trunk+paravirt/cr3:       1.60
>
> i.e. 2-task context-switch performance is faster by a factor of 4, and 
> is now quite close to native speed!
>   

Very impressive!  The gain probably comes not only from avoiding the 
vmentry/vmexit, but also from avoiding the flushing of the global page 
tlb entries.

> "hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
>
>  native:                       0.25
>  ----------------------------------
>  Qemu:                         7.8
>  KVM upstream:                 2.8
>  KVM trunk:                    0.55
>  KVM paravirt/cr3:             0.36
>
> almost twice as fast.
>
> "hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
>
>  native:                       0.9
>  ----------------------------------
>  Qemu:                        35.2
>  KVM upstream:                 9.4
>  KVM trunk:                    2.8
>  KVM paravirt/cr3:             2.2
>
> still a 30% improvement - which isnt too bad considering that 200 tasks 
> are context-switching in this workload and the cr3 cache in current CPUs 
> is only 4 entries.
>   

This is a little too good to be true.  Were both runs with the same 
KVM_NUM_MMU_PAGES?

I'm also concerned that at this point in time the cr3 optimizations will 
only show an improvement in microbenchmarks.  In real life workloads a 
context switch is usually preceded by an I/O, and with the current sorry 
state of kvm I/O the context switch time would be dominated by the I/O time.

> the patchset does the following:
>
> - it provides an ad-hoc paravirtualization hypercall API between a Linux 
>   guest and a Linux host. (this will be replaced with a proper
>   hypercall later on.)
>
> - using the hypercall API it utilizes the "cr3 target cache" feature in 
>   Intel VMX CPUs, and extends KVM to make use of that cache. This 
>   feature allows the avoidance of expensive VM exits into hypervisor 
>   context. (The guest needs to be 'aware' and the cache has to be
>   shared between the guest and the hypervisor. So fully emulated OSs
>   wont benefit from this feature.)
>
> - a few simpler paravirtualization changes are done for Linux guests: IO 
>   port delays do not cause a VM exit anymore, the i8259A IRQ controller 
>   code got simplified (this will be replaced with a proper, hypercall
>   based and host-maintained IRQ controller implementation) and TLB 
>   flushes are more efficient, because no cr3 reads happen which would 
>   otherwise cause a VM exit. These changes have a visible effect
>   already: they reduce qemu's CPU usage when a guest idles in HLT, by 
>   about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has 
>   HZ=1000)
>
> Paravirtualization is triggered via the kvm_paravirt=1 boot option (for 
> now, this too is ad-hoc) - if that is passed then the KVM guest will 
> probe for paravirtualization availability on the hypervisor side - and 
> will use it if found. (If the guest does not find KVM-paravirt support 
> on the hypervisor side then it will continue as a fully emulated guest.)
>
> Issues: i only tested this on 32-bit VMX. (64-bit should work with not 
> too many changes, the paravirt.c bits can be carried over to 64-bit 
> almost as-is. But i didnt want to spread the code too wide.)
>
> Comments, suggestions are welcome!
>
> 	Ingo
>   

Some comments below on the code.

> +
> +/*
> + * Special, register-to-cr3 instruction based hypercall API
> + * variant to the KVM host. This utilizes the cr3 filter capability
> + * of the hardware - if this works out then no VM exit happens,
> + * if a VM exit happens then KVM will get the virtual address too.
> + */
> +static void kvm_write_cr3(unsigned long guest_cr3)
> +{
> +    struct kvm_vcpu_para_state *para_state = &get_cpu_var(para_state);
> +    struct kvm_cr3_cache *cache = &para_state->cr3_cache;
> +    int idx;
> +
> +    /*
> +     * Check the cache (maintained by the host) for a matching
> +     * guest_cr3 => host_cr3 mapping. Use it if found:
> +     */
> +    for (idx = 0; idx < cache->max_idx; idx++) {
> +        if (cache->entry[idx].guest_cr3 == guest_cr3) {
> +            /*
> +             * Cache-hit: we load the cached host-CR3 value.
> +             * This never causes any VM exit. (if it does then the
> +             * hypervisor could do nothing with this instruction
> +             * and the guest OS would be aborted)
> +             */
> +            asm volatile("movl %0, %%cr3"
> +                : : "r" (cache->entry[idx].host_cr3));
> +            goto out;
> +        }
> +    }
> +
> +    /*
> +     * Cache-miss. Load the guest-cr3 value into cr3, which will
> +     * cause a VM exit to the hypervisor, which then loads the
> +     * host cr3 value and updates the cr3_cache.
> +     */
> +    asm volatile("movl %0, %%cr3" : : "r" (guest_cr3));
> +out:
> +    put_cpu_var(para_state);
> +}

Well, you did say it was ad-hoc.  For reference, this is how I see the 
hypercall API:

- A virtual pci device exports a page through the pci rom interface.  
The page contains the hypercall code approriate for the current cpu.  
This allows migration to work across different cpu vendors.
- In case the pci rom is discovered too late in the boot process, the 
address (gpa) can also be exported via a kvm-specific msr.
- Guest/host communications is by guest physical addressed, as the 
virtual->physical translation is much cheaper on the guest (__pa() vs a 
page table walk).


>
> +
> +/*
> + * Simplified i8259A controller handling:
> + */
> +static void mask_and_ack_kvm(unsigned int irq)
> +{
> +    unsigned int irqmask = 1 << irq;
> +    unsigned long flags;
> +
> +    spin_lock_irqsave(&i8259A_lock, flags);
> +    cached_irq_mask |= irqmask;
> +
> +    if (irq & 8) {
> +        outb(cached_slave_mask, PIC_SLAVE_IMR);
> +        outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> +        outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' 
> to master-IRQ2 */
> +    } else {
> +        outb(cached_master_mask, PIC_MASTER_IMR);
> +        /* 'Specific EOI' to master: */
> +        outb(0x60+irq, PIC_MASTER_CMD);
> +    }
> +    spin_unlock_irqrestore(&i8259A_lock, flags);
> +}

Any reason this can't be applied to mainline?  There's probably no 
downside to native, and it would benefit all virtualization solutions 
equally.

> Index: linux/drivers/kvm/kvm.h
> ===================================================================
> --- linux.orig/drivers/kvm/kvm.h
> +++ linux/drivers/kvm/kvm.h
> @@ -165,7 +165,7 @@ struct kvm_mmu {
>      int root_level;
>      int shadow_root_level;
>  
> -    u64 *pae_root;
> +    u64 *pae_root[KVM_CR3_CACHE_SIZE];

hmm.  wouldn't it be simpler to have pae_root always point at the 
current root?

>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -779,7 +779,7 @@ static int nonpaging_map(struct kvm_vcpu
>  
>  static void mmu_free_roots(struct kvm_vcpu *vcpu)
>  {
> -    int i;
> +    int i, j;
>      struct kvm_mmu_page *page;
>  
>  #ifdef CONFIG_X86_64
> @@ -793,21 +793,40 @@ static void mmu_free_roots(struct kvm_vc
>          return;
>      }
>  #endif
> -    for (i = 0; i < 4; ++i) {
> -        hpa_t root = vcpu->mmu.pae_root[i];
> +    /*
> +     * Skip to the next cr3 filter entry and free it (if it's occupied):
> +     */
> +    vcpu->cr3_cache_idx++;
> +    if (unlikely(vcpu->cr3_cache_idx >= vcpu->cr3_cache_limit))
> +        vcpu->cr3_cache_idx = 0;
>  
> -        ASSERT(VALID_PAGE(root));
> -        root &= PT64_BASE_ADDR_MASK;
> -        page = page_header(root);
> -        --page->root_count;
> -        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> +    j = vcpu->cr3_cache_idx;
> +    /*
> +     * Clear the guest-visible entry:
> +     */
> +    if (vcpu->para_state) {
> +        vcpu->para_state->cr3_cache.entry[j].guest_cr3 = 0;
> +        vcpu->para_state->cr3_cache.entry[j].host_cr3 = 0;
> +    }
> +    ASSERT(vcpu->mmu.pae_root[j]);
> +    if (VALID_PAGE(vcpu->mmu.pae_root[j][0])) {
> +        vcpu->guest_cr3_gpa[j] = INVALID_PAGE;
> +        for (i = 0; i < 4; ++i) {
> +            hpa_t root = vcpu->mmu.pae_root[j][i];
> +
> +            ASSERT(VALID_PAGE(root));
> +            root &= PT64_BASE_ADDR_MASK;
> +            page = page_header(root);
> +            --page->root_count;
> +            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> +        }
>      }
>      vcpu->mmu.root_hpa = INVALID_PAGE;
>  }

You keep the page directories pinned here.  This can be a problem if a 
guest frees a page directory, and then starts using it as a regular 
page.  kvm sometimes chooses not to emulate a write to a guest page 
table, but instead to zap it, which is impossible when the page is 
freed.  You need to either unpin the page when that happens, or add a 
hypercall to let kvm know when a page directory is freed.

There is also a minor problem that changes to the pgd aren't caught by 
kvm.  It doesn't hurt much as this is PV and we can relax the guest/host 
contract a little.


>
>  static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
>  {
>      struct page *page;
> -    int i;
> +    int i, j;
>  
>      ASSERT(vcpu);
>  
> @@ -1227,17 +1261,22 @@ static int alloc_mmu_pages(struct kvm_vc
>          ++vcpu->kvm->n_free_mmu_pages;
>      }
>  
> -    /*
> -     * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
> -     * Therefore we need to allocate shadow page tables in the first
> -     * 4GB of memory, which happens to fit the DMA32 zone.
> -     */
> -    page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> -    if (!page)
> -        goto error_1;
> -    vcpu->mmu.pae_root = page_address(page);
> -    for (i = 0; i < 4; ++i)
> -        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> +    for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> +        /*
> +         * When emulating 32-bit mode, cr3 is only 32 bits even on
> +         * x86_64. Therefore we need to allocate shadow page tables
> +         * in the first 4GB of memory, which happens to fit the DMA32
> +         * zone:
> +         */
> +        page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> +        if (!page)
> +            goto error_1;
> +
> +        ASSERT(!vcpu->mmu.pae_root[j]);
> +        vcpu->mmu.pae_root[j] = page_address(page);
> +        for (i = 0; i < 4; ++i)
> +            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> +    }

Since a pae root uses just 32 bytes, you can store all cache entries in 
a single page.  Not that it matters much.

>  #define ENABLE_INTERRUPTS_SYSEXIT            \
> Index: linux/include/linux/kvm.h
> ===================================================================
> --- linux.orig/include/linux/kvm.h
> +++ linux/include/linux/kvm.h
> @@ -238,4 +238,44 @@ struct kvm_dirty_log {
>  
>  #define KVM_DUMP_VCPU             _IOW(KVMIO, 250, int /* vcpu_slot */)
>  
> +
> +#define KVM_CR3_CACHE_SIZE    4
> +
> +struct kvm_cr3_cache_entry {
> +    u64 guest_cr3;
> +    u64 host_cr3;
> +};
> +
> +struct kvm_cr3_cache {
> +    struct kvm_cr3_cache_entry entry[KVM_CR3_CACHE_SIZE];
> +    u32 max_idx;
> +};
> +
> +/*
> + * Per-VCPU descriptor area shared between guest and host. Writable to
> + * both guest and host. Registered with the host by the guest when
> + * a guest acknowledges paravirtual mode.
> + */
> +struct kvm_vcpu_para_state {
> +    /*
> +     * API version information for compatibility. If there's any support
> +     * mismatch (too old host trying to execute too new guest) then
> +     * the host will deny entry into paravirtual mode. Any other
> +     * combination (new host + old guest and new host + new guest)
> +     * is supposed to work - new host versions will support all old
> +     * guest API versions.
> +     */
> +    u32 guest_version;
> +    u32 host_version;
> +    u32 size;
> +    u32 __pad_00;
> +
> +    struct kvm_cr3_cache cr3_cache;
> +
> +} __attribute__ ((aligned(PAGE_SIZE)));
> +
> +#define KVM_PARA_API_VERSION 1
> +
> +#define KVM_API_MAGIC 0x87654321
> +

<linux/kvm.h> is the vmm userspace interface.  The guest/host interface 
should probably go somewhere else.

-- 
error compiling committee.c: too many arguments to function

WARNING: multiple messages have this Message-ID (diff)

From: Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
To: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Cc: kvm-devel
	<kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [announce] [patch] KVM paravirtualization for Linux
Date: Sun, 07 Jan 2007 14:20:22 +0200	[thread overview]
Message-ID: <45A0E586.50806@qumranet.com> (raw)
In-Reply-To: <20070105215223.GA5361-X9Un+BFzKDI@public.gmane.org>

Ingo Molnar wrote:
> i'm pleased to announce the first release of paravirtualized KVM (Linux 
> under Linux), which includes support for the hardware cr3-cache feature 
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
>    http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the 
> code is stable enough in my testing and i'd like to have some feedback. 
>
> Firstly, here are some numbers:
>
> 2-task context-switch performance (in microseconds, lower is better):
>
>  native:                       1.11
>  ----------------------------------
>  Qemu:                        61.18
>  KVM upstream:                53.01
>  KVM trunk:                    6.36
>  KVM trunk+paravirt/cr3:       1.60
>
> i.e. 2-task context-switch performance is faster by a factor of 4, and 
> is now quite close to native speed!
>   

Very impressive!  The gain probably comes not only from avoiding the 
vmentry/vmexit, but also from avoiding the flushing of the global page 
tlb entries.

> "hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
>
>  native:                       0.25
>  ----------------------------------
>  Qemu:                         7.8
>  KVM upstream:                 2.8
>  KVM trunk:                    0.55
>  KVM paravirt/cr3:             0.36
>
> almost twice as fast.
>
> "hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
>
>  native:                       0.9
>  ----------------------------------
>  Qemu:                        35.2
>  KVM upstream:                 9.4
>  KVM trunk:                    2.8
>  KVM paravirt/cr3:             2.2
>
> still a 30% improvement - which isnt too bad considering that 200 tasks 
> are context-switching in this workload and the cr3 cache in current CPUs 
> is only 4 entries.
>   

This is a little too good to be true.  Were both runs with the same 
KVM_NUM_MMU_PAGES?

I'm also concerned that at this point in time the cr3 optimizations will 
only show an improvement in microbenchmarks.  In real life workloads a 
context switch is usually preceded by an I/O, and with the current sorry 
state of kvm I/O the context switch time would be dominated by the I/O time.

> the patchset does the following:
>
> - it provides an ad-hoc paravirtualization hypercall API between a Linux 
>   guest and a Linux host. (this will be replaced with a proper
>   hypercall later on.)
>
> - using the hypercall API it utilizes the "cr3 target cache" feature in 
>   Intel VMX CPUs, and extends KVM to make use of that cache. This 
>   feature allows the avoidance of expensive VM exits into hypervisor 
>   context. (The guest needs to be 'aware' and the cache has to be
>   shared between the guest and the hypervisor. So fully emulated OSs
>   wont benefit from this feature.)
>
> - a few simpler paravirtualization changes are done for Linux guests: IO 
>   port delays do not cause a VM exit anymore, the i8259A IRQ controller 
>   code got simplified (this will be replaced with a proper, hypercall
>   based and host-maintained IRQ controller implementation) and TLB 
>   flushes are more efficient, because no cr3 reads happen which would 
>   otherwise cause a VM exit. These changes have a visible effect
>   already: they reduce qemu's CPU usage when a guest idles in HLT, by 
>   about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has 
>   HZ=1000)
>
> Paravirtualization is triggered via the kvm_paravirt=1 boot option (for 
> now, this too is ad-hoc) - if that is passed then the KVM guest will 
> probe for paravirtualization availability on the hypervisor side - and 
> will use it if found. (If the guest does not find KVM-paravirt support 
> on the hypervisor side then it will continue as a fully emulated guest.)
>
> Issues: i only tested this on 32-bit VMX. (64-bit should work with not 
> too many changes, the paravirt.c bits can be carried over to 64-bit 
> almost as-is. But i didnt want to spread the code too wide.)
>
> Comments, suggestions are welcome!
>
> 	Ingo
>   

Some comments below on the code.

> +
> +/*
> + * Special, register-to-cr3 instruction based hypercall API
> + * variant to the KVM host. This utilizes the cr3 filter capability
> + * of the hardware - if this works out then no VM exit happens,
> + * if a VM exit happens then KVM will get the virtual address too.
> + */
> +static void kvm_write_cr3(unsigned long guest_cr3)
> +{
> +    struct kvm_vcpu_para_state *para_state = &get_cpu_var(para_state);
> +    struct kvm_cr3_cache *cache = &para_state->cr3_cache;
> +    int idx;
> +
> +    /*
> +     * Check the cache (maintained by the host) for a matching
> +     * guest_cr3 => host_cr3 mapping. Use it if found:
> +     */
> +    for (idx = 0; idx < cache->max_idx; idx++) {
> +        if (cache->entry[idx].guest_cr3 == guest_cr3) {
> +            /*
> +             * Cache-hit: we load the cached host-CR3 value.
> +             * This never causes any VM exit. (if it does then the
> +             * hypervisor could do nothing with this instruction
> +             * and the guest OS would be aborted)
> +             */
> +            asm volatile("movl %0, %%cr3"
> +                : : "r" (cache->entry[idx].host_cr3));
> +            goto out;
> +        }
> +    }
> +
> +    /*
> +     * Cache-miss. Load the guest-cr3 value into cr3, which will
> +     * cause a VM exit to the hypervisor, which then loads the
> +     * host cr3 value and updates the cr3_cache.
> +     */
> +    asm volatile("movl %0, %%cr3" : : "r" (guest_cr3));
> +out:
> +    put_cpu_var(para_state);
> +}

Well, you did say it was ad-hoc.  For reference, this is how I see the 
hypercall API:

- A virtual pci device exports a page through the pci rom interface.  
The page contains the hypercall code approriate for the current cpu.  
This allows migration to work across different cpu vendors.
- In case the pci rom is discovered too late in the boot process, the 
address (gpa) can also be exported via a kvm-specific msr.
- Guest/host communications is by guest physical addressed, as the 
virtual->physical translation is much cheaper on the guest (__pa() vs a 
page table walk).


>
> +
> +/*
> + * Simplified i8259A controller handling:
> + */
> +static void mask_and_ack_kvm(unsigned int irq)
> +{
> +    unsigned int irqmask = 1 << irq;
> +    unsigned long flags;
> +
> +    spin_lock_irqsave(&i8259A_lock, flags);
> +    cached_irq_mask |= irqmask;
> +
> +    if (irq & 8) {
> +        outb(cached_slave_mask, PIC_SLAVE_IMR);
> +        outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> +        outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' 
> to master-IRQ2 */
> +    } else {
> +        outb(cached_master_mask, PIC_MASTER_IMR);
> +        /* 'Specific EOI' to master: */
> +        outb(0x60+irq, PIC_MASTER_CMD);
> +    }
> +    spin_unlock_irqrestore(&i8259A_lock, flags);
> +}

Any reason this can't be applied to mainline?  There's probably no 
downside to native, and it would benefit all virtualization solutions 
equally.

> Index: linux/drivers/kvm/kvm.h
> ===================================================================
> --- linux.orig/drivers/kvm/kvm.h
> +++ linux/drivers/kvm/kvm.h
> @@ -165,7 +165,7 @@ struct kvm_mmu {
>      int root_level;
>      int shadow_root_level;
>  
> -    u64 *pae_root;
> +    u64 *pae_root[KVM_CR3_CACHE_SIZE];

hmm.  wouldn't it be simpler to have pae_root always point at the 
current root?

>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -779,7 +779,7 @@ static int nonpaging_map(struct kvm_vcpu
>  
>  static void mmu_free_roots(struct kvm_vcpu *vcpu)
>  {
> -    int i;
> +    int i, j;
>      struct kvm_mmu_page *page;
>  
>  #ifdef CONFIG_X86_64
> @@ -793,21 +793,40 @@ static void mmu_free_roots(struct kvm_vc
>          return;
>      }
>  #endif
> -    for (i = 0; i < 4; ++i) {
> -        hpa_t root = vcpu->mmu.pae_root[i];
> +    /*
> +     * Skip to the next cr3 filter entry and free it (if it's occupied):
> +     */
> +    vcpu->cr3_cache_idx++;
> +    if (unlikely(vcpu->cr3_cache_idx >= vcpu->cr3_cache_limit))
> +        vcpu->cr3_cache_idx = 0;
>  
> -        ASSERT(VALID_PAGE(root));
> -        root &= PT64_BASE_ADDR_MASK;
> -        page = page_header(root);
> -        --page->root_count;
> -        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> +    j = vcpu->cr3_cache_idx;
> +    /*
> +     * Clear the guest-visible entry:
> +     */
> +    if (vcpu->para_state) {
> +        vcpu->para_state->cr3_cache.entry[j].guest_cr3 = 0;
> +        vcpu->para_state->cr3_cache.entry[j].host_cr3 = 0;
> +    }
> +    ASSERT(vcpu->mmu.pae_root[j]);
> +    if (VALID_PAGE(vcpu->mmu.pae_root[j][0])) {
> +        vcpu->guest_cr3_gpa[j] = INVALID_PAGE;
> +        for (i = 0; i < 4; ++i) {
> +            hpa_t root = vcpu->mmu.pae_root[j][i];
> +
> +            ASSERT(VALID_PAGE(root));
> +            root &= PT64_BASE_ADDR_MASK;
> +            page = page_header(root);
> +            --page->root_count;
> +            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> +        }
>      }
>      vcpu->mmu.root_hpa = INVALID_PAGE;
>  }

You keep the page directories pinned here.  This can be a problem if a 
guest frees a page directory, and then starts using it as a regular 
page.  kvm sometimes chooses not to emulate a write to a guest page 
table, but instead to zap it, which is impossible when the page is 
freed.  You need to either unpin the page when that happens, or add a 
hypercall to let kvm know when a page directory is freed.

There is also a minor problem that changes to the pgd aren't caught by 
kvm.  It doesn't hurt much as this is PV and we can relax the guest/host 
contract a little.


>
>  static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
>  {
>      struct page *page;
> -    int i;
> +    int i, j;
>  
>      ASSERT(vcpu);
>  
> @@ -1227,17 +1261,22 @@ static int alloc_mmu_pages(struct kvm_vc
>          ++vcpu->kvm->n_free_mmu_pages;
>      }
>  
> -    /*
> -     * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
> -     * Therefore we need to allocate shadow page tables in the first
> -     * 4GB of memory, which happens to fit the DMA32 zone.
> -     */
> -    page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> -    if (!page)
> -        goto error_1;
> -    vcpu->mmu.pae_root = page_address(page);
> -    for (i = 0; i < 4; ++i)
> -        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> +    for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> +        /*
> +         * When emulating 32-bit mode, cr3 is only 32 bits even on
> +         * x86_64. Therefore we need to allocate shadow page tables
> +         * in the first 4GB of memory, which happens to fit the DMA32
> +         * zone:
> +         */
> +        page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> +        if (!page)
> +            goto error_1;
> +
> +        ASSERT(!vcpu->mmu.pae_root[j]);
> +        vcpu->mmu.pae_root[j] = page_address(page);
> +        for (i = 0; i < 4; ++i)
> +            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> +    }

Since a pae root uses just 32 bytes, you can store all cache entries in 
a single page.  Not that it matters much.

>  #define ENABLE_INTERRUPTS_SYSEXIT            \
> Index: linux/include/linux/kvm.h
> ===================================================================
> --- linux.orig/include/linux/kvm.h
> +++ linux/include/linux/kvm.h
> @@ -238,4 +238,44 @@ struct kvm_dirty_log {
>  
>  #define KVM_DUMP_VCPU             _IOW(KVMIO, 250, int /* vcpu_slot */)
>  
> +
> +#define KVM_CR3_CACHE_SIZE    4
> +
> +struct kvm_cr3_cache_entry {
> +    u64 guest_cr3;
> +    u64 host_cr3;
> +};
> +
> +struct kvm_cr3_cache {
> +    struct kvm_cr3_cache_entry entry[KVM_CR3_CACHE_SIZE];
> +    u32 max_idx;
> +};
> +
> +/*
> + * Per-VCPU descriptor area shared between guest and host. Writable to
> + * both guest and host. Registered with the host by the guest when
> + * a guest acknowledges paravirtual mode.
> + */
> +struct kvm_vcpu_para_state {
> +    /*
> +     * API version information for compatibility. If there's any support
> +     * mismatch (too old host trying to execute too new guest) then
> +     * the host will deny entry into paravirtual mode. Any other
> +     * combination (new host + old guest and new host + new guest)
> +     * is supposed to work - new host versions will support all old
> +     * guest API versions.
> +     */
> +    u32 guest_version;
> +    u32 host_version;
> +    u32 size;
> +    u32 __pad_00;
> +
> +    struct kvm_cr3_cache cr3_cache;
> +
> +} __attribute__ ((aligned(PAGE_SIZE)));
> +
> +#define KVM_PARA_API_VERSION 1
> +
> +#define KVM_API_MAGIC 0x87654321
> +

<linux/kvm.h> is the vmm userspace interface.  The guest/host interface 
should probably go somewhere else.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

next prev parent reply	other threads:[~2007-01-07 12:20 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-05 21:52 [announce] [patch] KVM paravirtualization for Linux Ingo Molnar
2007-01-05 21:52 ` Ingo Molnar
2007-01-05 22:15 ` Zachary Amsden
2007-01-05 22:15   ` Zachary Amsden
2007-01-05 22:30   ` Ingo Molnar
2007-01-05 22:30     ` Ingo Molnar
2007-01-05 22:50     ` Zachary Amsden
2007-01-05 22:50       ` Zachary Amsden
2007-01-05 23:28       ` Ingo Molnar
2007-01-05 23:02 ` [kvm-devel] " Anthony Liguori
2007-01-06 13:08 ` Pavel Machek
2007-01-06 13:08   ` Pavel Machek
2007-01-07 18:29   ` Christoph Hellwig
2007-01-07 18:29     ` Christoph Hellwig
2007-01-08 18:18   ` Christoph Lameter
2007-01-07 12:20 ` Avi Kivity [this message]
2007-01-07 12:20   ` Avi Kivity
2007-01-07 17:42   ` [kvm-devel] " Hollis Blanchard
2007-01-07 17:42     ` Hollis Blanchard
2007-01-07 17:44   ` Ingo Molnar
2007-01-07 17:44     ` Ingo Molnar
2007-01-08  8:22     ` Avi Kivity
2007-01-08  8:22       ` Avi Kivity
2007-01-08  8:39       ` Ingo Molnar
2007-01-08  8:39         ` Ingo Molnar
2007-01-08  9:08         ` Avi Kivity
2007-01-08  9:08           ` Avi Kivity
2007-01-08  9:18           ` Ingo Molnar
2007-01-08  9:18             ` Ingo Molnar
2007-01-08  9:31             ` Avi Kivity
2007-01-08  9:31               ` Avi Kivity
2007-01-08  9:43               ` Ingo Molnar
2007-01-08  9:43                 ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45A0E586.50806@qumranet.com \
    --to=avi@qumranet.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.