Re: [announce] [patch] KVM paravirtualization for Linux

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@elte.hu>
To: Avi Kivity <avi@qumranet.com>
Cc: kvm-devel <kvm-devel@lists.sourceforge.net>,
	linux-kernel@vger.kernel.org
Subject: Re: [announce] [patch] KVM paravirtualization for Linux
Date: Sun, 7 Jan 2007 18:44:17 +0100	[thread overview]
Message-ID: <20070107174416.GA14607@elte.hu> (raw)
In-Reply-To: <45A0E586.50806@qumranet.com>

* Avi Kivity <avi@qumranet.com> wrote:

> >2-task context-switch performance (in microseconds, lower is better):
> >
> > native:                       1.11
> > ----------------------------------
> > Qemu:                        61.18
> > KVM upstream:                53.01
> > KVM trunk:                    6.36
> > KVM trunk+paravirt/cr3:       1.60
> >
> >i.e. 2-task context-switch performance is faster by a factor of 4, and 
> >is now quite close to native speed!
> >  
> 
> Very impressive!  The gain probably comes not only from avoiding the 
> vmentry/vmexit, but also from avoiding the flushing of the global page 
> tlb entries.

90% of the win comes from the avoidance of the VM exit. To quantify this 
more precisely i have added an artificial __flush_tlb_global() call to 
after switch_to(), just to see how much impact an extra global flush has 
on the native kernel. Context-switch cost went from 1.11 usecs to 1.65 
usecs. Then i added a __flush_tlb(), which made the cost go to 1.75, 
which means that the global flush component is at around 0.5 usecs.

> >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
> >
> > native:                       0.25
> > ----------------------------------
> > Qemu:                         7.8
> > KVM upstream:                 2.8
> > KVM trunk:                    0.55
> > KVM paravirt/cr3:             0.36
> >
> >almost twice as fast.
> >
> >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
> >
> > native:                       0.9
> > ----------------------------------
> > Qemu:                        35.2
> > KVM upstream:                 9.4
> > KVM trunk:                    2.8
> > KVM paravirt/cr3:             2.2
> >
> >still a 30% improvement - which isnt too bad considering that 200 tasks 
> >are context-switching in this workload and the cr3 cache in current CPUs 
> >is only 4 entries.
> >  
> 
> This is a little too good to be true.  Were both runs with the same 
> KVM_NUM_MMU_PAGES?

yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
run should have been labeled as: 'cr3 tree with paravirt turned off'. 
That's not completely 'trunk' but close to it, and all other changes 
(like elimination of unnecessary TLB flushes) are fairly applied to 
both.

i also did a run with much less MMU cache pages of 256, and hackbench 1 
stayed the same, while hackbench 5 numbers started fluctuating badly (i 
think that workload if trashing the MMU cache badly).

> I'm also concerned that at this point in time the cr3 optimizations 
> will only show an improvement in microbenchmarks.  In real life 
> workloads a context switch is usually preceded by an I/O, and with the 
> current sorry state of kvm I/O the context switch time would be 
> dominated by the I/O time.

oh, i agreed completely - but in my opinion accelerating virtual I/O is 
really easy. Accelerating the context-switch path (and basic syscall 
overhead like KVM does) is /hard/. So i wanted to see whether KVM runs 
well in all the hard cases, before looking at the low hanging 
performance fruits in the I/O area =B-)

also note that there's lots of internal reasons why application 
workloads can be heavily context-switching - it's not just I/O that 
generates them. (pipes, critical sections / futexes, etc.) So having 
near-native performance for context-switches is very important.

> >+    if (irq & 8) {
> >+        outb(cached_slave_mask, PIC_SLAVE_IMR);
> >+        outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> >+        outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' 
> >to master-IRQ2 */
> >+    } else {
> >+        outb(cached_master_mask, PIC_MASTER_IMR);
> >+        /* 'Specific EOI' to master: */
> >+        outb(0x60+irq, PIC_MASTER_CMD);
> >+    }
> >+    spin_unlock_irqrestore(&i8259A_lock, flags);
> >+}
> 
> Any reason this can't be applied to mainline?  There's probably no 
> downside to native, and it would benefit all virtualization solutions 
> equally.

this is legacy stuff ...

> >-    u64 *pae_root;
> >+    u64 *pae_root[KVM_CR3_CACHE_SIZE];
> 
> hmm.  wouldn't it be simpler to have pae_root always point at the 
> current root?

does that guarantee that it's available? I wanted to 'pin' the root 
itself this way, to make sure that if a guest switches to it via the 
cache, that it's truly available and a valid root. cr3 addresses are 
non-virtual so this is the only mechanism available to guarantee that 
the host-side memory truly contains a root pagetable.

> >+            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+        }
> >     }
> >     vcpu->mmu.root_hpa = INVALID_PAGE;
> > }
> 
> You keep the page directories pinned here. [...]

yes.

> [...]  This can be a problem if a guest frees a page directory, and 
> then starts using it as a regular page.  kvm sometimes chooses not to 
> emulate a write to a guest page table, but instead to zap it, which is 
> impossible when the page is freed.  You need to either unpin the page 
> when that happens, or add a hypercall to let kvm know when a page 
> directory is freed.

the cache is zapped upon pagefaults anyway, so unpinning ought to be 
possible. Which one would you prefer?

> >-    for (i = 0; i < 4; ++i)
> >-        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> >+    for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> >+        /*
> >+         * When emulating 32-bit mode, cr3 is only 32 bits even on
> >+         * x86_64. Therefore we need to allocate shadow page tables
> >+         * in the first 4GB of memory, which happens to fit the DMA32
> >+         * zone:
> >+         */
> >+        page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> >+        if (!page)
> >+            goto error_1;
> >+
> >+        ASSERT(!vcpu->mmu.pae_root[j]);
> >+        vcpu->mmu.pae_root[j] = page_address(page);
> >+        for (i = 0; i < 4; ++i)
> >+            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+    }
> 
> Since a pae root uses just 32 bytes, you can store all cache entries 
> in a single page.  Not that it matters much.

yeah - i wanted to extend the current code in a safe way, before 
optimizing it.

> >+#define KVM_API_MAGIC 0x87654321
> >+
> 
> <linux/kvm.h> is the vmm userspace interface.  The guest/host 
> interface should probably go somewhere else.

yeah. kvm_para.h?

	Ingo

next prev parent reply	other threads:[~2007-01-07 17:47 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-05 21:52 [announce] [patch] KVM paravirtualization for Linux Ingo Molnar
2007-01-05 22:15 ` Zachary Amsden
2007-01-05 22:30   ` Ingo Molnar
2007-01-05 22:50     ` Zachary Amsden
2007-01-05 23:28       ` Ingo Molnar
2007-01-05 23:02 ` [kvm-devel] " Anthony Liguori
2007-01-06 13:08 ` Pavel Machek
2007-01-07 18:29   ` Christoph Hellwig
2007-01-08 18:18   ` Christoph Lameter
2007-01-07 12:20 ` Avi Kivity
2007-01-07 17:42   ` [kvm-devel] " Hollis Blanchard
2007-01-07 17:44   ` Ingo Molnar [this message]
2007-01-08  8:22     ` Avi Kivity
2007-01-08  8:39       ` Ingo Molnar
2007-01-08  9:08         ` Avi Kivity
2007-01-08  9:18           ` Ingo Molnar
2007-01-08  9:31             ` Avi Kivity
2007-01-08  9:43               ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070107174416.GA14607@elte.hu \
    --to=mingo@elte.hu \
    --cc=avi@qumranet.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox