From: Ingo Molnar <mingo@elte.hu>
To: Avi Kivity <avi@qumranet.com>
Cc: kvm-devel <kvm-devel@lists.sourceforge.net>,
linux-kernel@vger.kernel.org
Subject: Re: [announce] [patch] KVM paravirtualization for Linux
Date: Sun, 7 Jan 2007 18:44:17 +0100 [thread overview]
Message-ID: <20070107174416.GA14607@elte.hu> (raw)
In-Reply-To: <45A0E586.50806@qumranet.com>
* Avi Kivity <avi@qumranet.com> wrote:
> >2-task context-switch performance (in microseconds, lower is better):
> >
> > native: 1.11
> > ----------------------------------
> > Qemu: 61.18
> > KVM upstream: 53.01
> > KVM trunk: 6.36
> > KVM trunk+paravirt/cr3: 1.60
> >
> >i.e. 2-task context-switch performance is faster by a factor of 4, and
> >is now quite close to native speed!
> >
>
> Very impressive! The gain probably comes not only from avoiding the
> vmentry/vmexit, but also from avoiding the flushing of the global page
> tlb entries.
90% of the win comes from the avoidance of the VM exit. To quantify this
more precisely i have added an artificial __flush_tlb_global() call to
after switch_to(), just to see how much impact an extra global flush has
on the native kernel. Context-switch cost went from 1.11 usecs to 1.65
usecs. Then i added a __flush_tlb(), which made the cost go to 1.75,
which means that the global flush component is at around 0.5 usecs.
> >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
> >
> > native: 0.25
> > ----------------------------------
> > Qemu: 7.8
> > KVM upstream: 2.8
> > KVM trunk: 0.55
> > KVM paravirt/cr3: 0.36
> >
> >almost twice as fast.
> >
> >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
> >
> > native: 0.9
> > ----------------------------------
> > Qemu: 35.2
> > KVM upstream: 9.4
> > KVM trunk: 2.8
> > KVM paravirt/cr3: 2.2
> >
> >still a 30% improvement - which isnt too bad considering that 200 tasks
> >are context-switching in this workload and the cr3 cache in current CPUs
> >is only 4 entries.
> >
>
> This is a little too good to be true. Were both runs with the same
> KVM_NUM_MMU_PAGES?
yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk'
run should have been labeled as: 'cr3 tree with paravirt turned off'.
That's not completely 'trunk' but close to it, and all other changes
(like elimination of unnecessary TLB flushes) are fairly applied to
both.
i also did a run with much less MMU cache pages of 256, and hackbench 1
stayed the same, while hackbench 5 numbers started fluctuating badly (i
think that workload if trashing the MMU cache badly).
> I'm also concerned that at this point in time the cr3 optimizations
> will only show an improvement in microbenchmarks. In real life
> workloads a context switch is usually preceded by an I/O, and with the
> current sorry state of kvm I/O the context switch time would be
> dominated by the I/O time.
oh, i agreed completely - but in my opinion accelerating virtual I/O is
really easy. Accelerating the context-switch path (and basic syscall
overhead like KVM does) is /hard/. So i wanted to see whether KVM runs
well in all the hard cases, before looking at the low hanging
performance fruits in the I/O area =B-)
also note that there's lots of internal reasons why application
workloads can be heavily context-switching - it's not just I/O that
generates them. (pipes, critical sections / futexes, etc.) So having
near-native performance for context-switches is very important.
> >+ if (irq & 8) {
> >+ outb(cached_slave_mask, PIC_SLAVE_IMR);
> >+ outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> >+ outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI'
> >to master-IRQ2 */
> >+ } else {
> >+ outb(cached_master_mask, PIC_MASTER_IMR);
> >+ /* 'Specific EOI' to master: */
> >+ outb(0x60+irq, PIC_MASTER_CMD);
> >+ }
> >+ spin_unlock_irqrestore(&i8259A_lock, flags);
> >+}
>
> Any reason this can't be applied to mainline? There's probably no
> downside to native, and it would benefit all virtualization solutions
> equally.
this is legacy stuff ...
> >- u64 *pae_root;
> >+ u64 *pae_root[KVM_CR3_CACHE_SIZE];
>
> hmm. wouldn't it be simpler to have pae_root always point at the
> current root?
does that guarantee that it's available? I wanted to 'pin' the root
itself this way, to make sure that if a guest switches to it via the
cache, that it's truly available and a valid root. cr3 addresses are
non-virtual so this is the only mechanism available to guarantee that
the host-side memory truly contains a root pagetable.
> >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+ }
> > }
> > vcpu->mmu.root_hpa = INVALID_PAGE;
> > }
>
> You keep the page directories pinned here. [...]
yes.
> [...] This can be a problem if a guest frees a page directory, and
> then starts using it as a regular page. kvm sometimes chooses not to
> emulate a write to a guest page table, but instead to zap it, which is
> impossible when the page is freed. You need to either unpin the page
> when that happens, or add a hypercall to let kvm know when a page
> directory is freed.
the cache is zapped upon pagefaults anyway, so unpinning ought to be
possible. Which one would you prefer?
> >- for (i = 0; i < 4; ++i)
> >- vcpu->mmu.pae_root[i] = INVALID_PAGE;
> >+ for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> >+ /*
> >+ * When emulating 32-bit mode, cr3 is only 32 bits even on
> >+ * x86_64. Therefore we need to allocate shadow page tables
> >+ * in the first 4GB of memory, which happens to fit the DMA32
> >+ * zone:
> >+ */
> >+ page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> >+ if (!page)
> >+ goto error_1;
> >+
> >+ ASSERT(!vcpu->mmu.pae_root[j]);
> >+ vcpu->mmu.pae_root[j] = page_address(page);
> >+ for (i = 0; i < 4; ++i)
> >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+ }
>
> Since a pae root uses just 32 bytes, you can store all cache entries
> in a single page. Not that it matters much.
yeah - i wanted to extend the current code in a safe way, before
optimizing it.
> >+#define KVM_API_MAGIC 0x87654321
> >+
>
> <linux/kvm.h> is the vmm userspace interface. The guest/host
> interface should probably go somewhere else.
yeah. kvm_para.h?
Ingo
next prev parent reply other threads:[~2007-01-07 17:47 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-05 21:52 [announce] [patch] KVM paravirtualization for Linux Ingo Molnar
2007-01-05 22:15 ` Zachary Amsden
2007-01-05 22:30 ` Ingo Molnar
2007-01-05 22:50 ` Zachary Amsden
2007-01-05 23:28 ` Ingo Molnar
2007-01-05 23:02 ` [kvm-devel] " Anthony Liguori
2007-01-06 13:08 ` Pavel Machek
2007-01-07 18:29 ` Christoph Hellwig
2007-01-08 18:18 ` Christoph Lameter
2007-01-07 12:20 ` Avi Kivity
2007-01-07 17:42 ` [kvm-devel] " Hollis Blanchard
2007-01-07 17:44 ` Ingo Molnar [this message]
2007-01-08 8:22 ` Avi Kivity
2007-01-08 8:39 ` Ingo Molnar
2007-01-08 9:08 ` Avi Kivity
2007-01-08 9:18 ` Ingo Molnar
2007-01-08 9:31 ` Avi Kivity
2007-01-08 9:43 ` Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070107174416.GA14607@elte.hu \
--to=mingo@elte.hu \
--cc=avi@qumranet.com \
--cc=kvm-devel@lists.sourceforge.net \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox