From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
To: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
LKML <linux-kernel@vger.kernel.org>, KVM <kvm@vger.kernel.org>
Subject: [PATCH 00/13] KVM: MMU: fast page fault
Date: Thu, 29 Mar 2012 17:20:17 +0800 [thread overview]
Message-ID: <4F742951.7080003@linux.vnet.ibm.com> (raw)
* Idea
The present bit of page fault error code (EFEC.P) indicates whether the
page table is populated on all levels, if this bit is set, we can know
the page fault is caused by the page-protection bits (e.g. W/R bit) or
the reserved bits.
In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
simply fixed: the page fault caused by reserved bit
(EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
is just increasing the corresponding access on the spte.
This pachset introduces a fast path to fix this kind of page fault: it
is out of mmu-lock and need not walk host page table to get the mapping
from gfn to pfn.
* Advantage
- it is really fast
it fixes page fault out of mmu-lock, and uses a very light way to avoid
the race with other pathes. Also, it fixes page fault in the front of
gfn_to_pfn, it means no host page table walking.
- we can get lots of page fault with PFEC.P = 1 in KVM:
- in the case of ept/npt
after shadow page become stable (all gfn is mapped in shadow page table,
it is a short stage since only one shadow page table is used and only a
few of page is needed), almost all page fault is caused by write-protect
(frame-buffer under Xwindow, migration), the other small part is caused
by page merge/COW under KSM/THP.
We do not hope it can fix the page fault caused by the read-only host
page of KSM, since after COW, all the spte pointing to the gfn will be
unmapped.
- in the case of soft mmu
- many spurious page fault due to tlb lazily flushed
- lots of write-protect page fault (dirty bit track for guest pte, shadow
page table write-protected, frame-buffer under Xwindow, migration, ...)
* Implementation
We can freely walk the page between walk_shadow_page_lockless_begin and
walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
In the most case, cmpxchg is fair enough to change the access bit of spte,
but the write-protect path on softmmu/nested mmu is a especial case: it is
a read-check-modify path: read spte, check W bit, then clear W bit. In order
to avoid marking spte writable after/during page write-protect, we do the
trick like below:
fast page fault path:
lock RCU
set identification in the spte
smp_mb()
if (!rmap.PTE_LIST_WRITE_PROTECT)
cmpxchg + w - vcpu-id
unlock RCU
write protect path:
lock mmu-lock
set rmap.PTE_LIST_WRITE_PROTECT
smp_mb()
if (spte.w || spte has identification)
clear w bit and identification
unlock mmu-lock
Setting identification in the spte is used to notify page-protect path to
modify the spte, then we can see the change in the cmpxchg.
Setting identification is also a trick: it only set the last bit of spte
that does not change the mapping and lose cpu status bits.
The identification should be unique to avoid the below race:
VCPU 0 VCPU 1 VCPU 2
lock RCU
spte + identification
check conditions
do write-protect, clear
identification
lock RCU
set identification
cmpxchg + w - identification
OOPS!!!
We choose the vcpu id as the unique value, currently, 254 vcpus on VMX
and 127 vcpus on softmmu can be fast. Keep it simply firtsly. :)
* Performance
It introduces a full memory barrier on the page write-protect path, i
have done the test of kernbench in the text mode which does not generate
write-protect page fault by frame-buffer avoiding the optimization
introduced by this patch, it shows no regression.
And there is the result tested by x11perf and migration on autotest:
x11perf (x11perf -repeat 10 -comppixwin500):
(Host: Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz * 4 + 4G
Guest: 4 vcpus + 1G)
- For ept:
$ x11perfcomp baseline-hard optimaze-hard
1: baseline-hard
2: optimaze-hard
1 2 Operation
-------- -------- ---------
7060.0 7150.0 Composite 500x500 from pixmap to window
- For shadow mmu:
$ x11perfcomp baseline-soft optimaze-soft
1: baseline-soft
2: optimaze-soft
1 2 Operation
-------- -------- ---------
6980.0 7490.0 Composite 500x500 from pixmap to window
( It is interesting that after this patch, the performance of x11perf on
softmmu is better than it on hardmmu, i have tested it for many times,
it is really true. :) )
autotest migration:
(Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 32G)
- For ept:
Before:
smp2.Fedora.16.64.migrate
Times .unix .with_autotest.dbench.unix total
1 102 204 309
2 68 203 275
3 67 218 289
After:
smp2.Fedora.16.64.migrate
Times .unix .with_autotest.dbench.unix total
1 103 189 295
2 67 188 259
3 64 202 271
- For shadow mmu:
Before:
smp2.Fedora.16.64.migrate
Times .unix .with_autotest.dbench.unix total
1 102 262 368
2 68 220 292
3 68 234 307
After:
smp2.Fedora.16.64.migrate
Times .unix .with_autotest.dbench.unix total
1 104 231 341
2 68 218 289
3 66 205 275
Any comments are welcome. :)
next reply other threads:[~2012-03-29 9:20 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-29 9:20 Xiao Guangrong [this message]
2012-03-29 9:20 ` [PATCH 01/13] KVM: MMU: properly assert spte on rmap_next path Xiao Guangrong
2012-03-29 9:21 ` [PATCH 02/13] KVM: MMU: abstract spte write-protect Xiao Guangrong
2012-03-29 11:11 ` Avi Kivity
2012-03-29 11:51 ` Xiao Guangrong
2012-03-29 9:22 ` [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte) Xiao Guangrong
2012-03-29 13:00 ` Avi Kivity
2012-03-30 3:51 ` Xiao Guangrong
2012-03-29 9:22 ` [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa) Xiao Guangrong
2012-03-29 13:07 ` Avi Kivity
2012-03-30 5:01 ` Xiao Guangrong
2012-04-01 12:42 ` Avi Kivity
2012-03-29 9:23 ` [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask Xiao Guangrong
2012-03-29 13:10 ` Avi Kivity
2012-03-29 15:28 ` Avi Kivity
2012-03-29 16:24 ` Avi Kivity
2012-03-29 9:23 ` [PATCH 06/13] KVM: VMX: export PFEC.P bit on ept Xiao Guangrong
2012-03-29 9:24 ` [PATCH 07/13] KVM: MMU: store more bits in rmap Xiao Guangrong
2012-03-29 9:25 ` [PATCH 08/13] KVM: MMU: fask check whether page is writable Xiao Guangrong
2012-03-29 15:49 ` Avi Kivity
2012-03-30 5:10 ` Xiao Guangrong
2012-04-01 15:52 ` Avi Kivity
2012-04-05 17:54 ` Xiao Guangrong
2012-04-12 23:08 ` Marcelo Tosatti
2012-04-13 10:26 ` Xiao Guangrong
2012-03-29 9:25 ` [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock Xiao Guangrong
2012-04-01 15:53 ` Avi Kivity
2012-04-05 18:25 ` Xiao Guangrong
2012-04-09 12:28 ` Avi Kivity
2012-04-09 13:16 ` Takuya Yoshikawa
2012-04-09 13:21 ` Avi Kivity
2012-03-29 9:26 ` [PATCH 10/13] KVM: MMU: store vcpu id in spte to notify page write-protect path Xiao Guangrong
2012-03-29 9:27 ` [PATCH 11/13] KVM: MMU: fast path of handling guest page fault Xiao Guangrong
2012-03-31 12:24 ` Xiao Guangrong
2012-04-01 16:23 ` Avi Kivity
2012-04-03 13:04 ` Avi Kivity
2012-04-05 19:39 ` Xiao Guangrong
2012-03-29 9:27 ` [PATCH 12/13] KVM: MMU: trace fast " Xiao Guangrong
2012-03-29 9:28 ` [PATCH 13/13] KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint Xiao Guangrong
2012-03-29 10:18 ` [PATCH 00/13] KVM: MMU: fast page fault Avi Kivity
2012-03-29 11:40 ` Xiao Guangrong
2012-03-29 12:57 ` Avi Kivity
2012-03-30 9:18 ` Xiao Guangrong
2012-03-31 13:12 ` Xiao Guangrong
2012-04-01 12:58 ` Avi Kivity
2012-04-05 21:57 ` Xiao Guangrong
2012-04-06 5:24 ` Xiao Guangrong
2012-04-09 13:20 ` Avi Kivity
2012-04-09 13:59 ` Xiao Guangrong
2012-04-09 13:12 ` Avi Kivity
2012-04-09 13:55 ` Xiao Guangrong
2012-04-09 14:01 ` Xiao Guangrong
2012-04-09 14:25 ` Avi Kivity
2012-04-09 17:58 ` Marcelo Tosatti
2012-04-09 18:13 ` Xiao Guangrong
2012-04-09 19:31 ` Marcelo Tosatti
2012-04-09 18:26 ` Xiao Guangrong
2012-04-09 19:46 ` Marcelo Tosatti
2012-04-10 3:06 ` Xiao Guangrong
2012-04-10 10:04 ` Avi Kivity
2012-04-11 1:47 ` Marcelo Tosatti
2012-04-11 9:15 ` Avi Kivity
2012-04-10 10:39 ` Avi Kivity
2012-04-10 11:40 ` Takuya Yoshikawa
2012-04-10 11:58 ` Xiao Guangrong
2012-04-11 12:15 ` Takuya Yoshikawa
2012-04-11 12:38 ` Xiao Guangrong
2012-04-11 14:14 ` Takuya Yoshikawa
2012-04-11 14:21 ` Avi Kivity
2012-04-11 22:26 ` Takuya Yoshikawa
2012-04-13 14:25 ` Takuya Yoshikawa
2012-04-15 9:32 ` Avi Kivity
2012-04-16 15:49 ` Takuya Yoshikawa
2012-04-16 16:02 ` Avi Kivity
2012-04-17 6:26 ` Xiao Guangrong
2012-04-17 7:51 ` Avi Kivity
2012-04-17 12:37 ` Takuya Yoshikawa
2012-04-17 12:41 ` Avi Kivity
2012-04-17 14:54 ` Takuya Yoshikawa
2012-04-17 14:56 ` Avi Kivity
2012-04-18 13:42 ` Takuya Yoshikawa
2012-04-17 6:16 ` Xiao Guangrong
2012-04-10 10:10 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F742951.7080003@linux.vnet.ibm.com \
--to=xiaoguangrong@linux.vnet.ibm.com \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mtosatti@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).