From: Avi Kivity <avi@redhat.com>
To: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
LKML <linux-kernel@vger.kernel.org>, KVM <kvm@vger.kernel.org>
Subject: Re: [PATCH 00/13] KVM: MMU: fast page fault
Date: Thu, 29 Mar 2012 12:18:35 +0200 [thread overview]
Message-ID: <4F7436FB.9000004@redhat.com> (raw)
In-Reply-To: <4F742951.7080003@linux.vnet.ibm.com>
On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> * Idea
> The present bit of page fault error code (EFEC.P) indicates whether the
> page table is populated on all levels, if this bit is set, we can know
> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> the reserved bits.
>
> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> simply fixed: the page fault caused by reserved bit
> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> is just increasing the corresponding access on the spte.
>
> This pachset introduces a fast path to fix this kind of page fault: it
> is out of mmu-lock and need not walk host page table to get the mapping
> from gfn to pfn.
Wow!
Looks like interesting times are back in mmu-land.
Comments below are before review of actual patches, so maybe they're
already answered there, or maybe they're just nonsense.
> * Advantage
> - it is really fast
> it fixes page fault out of mmu-lock, and uses a very light way to avoid
> the race with other pathes. Also, it fixes page fault in the front of
> gfn_to_pfn, it means no host page table walking.
>
> - we can get lots of page fault with PFEC.P = 1 in KVM:
> - in the case of ept/npt
> after shadow page become stable (all gfn is mapped in shadow page table,
> it is a short stage since only one shadow page table is used and only a
> few of page is needed), almost all page fault is caused by write-protect
> (frame-buffer under Xwindow, migration), the other small part is caused
> by page merge/COW under KSM/THP.
>
> We do not hope it can fix the page fault caused by the read-only host
> page of KSM, since after COW, all the spte pointing to the gfn will be
> unmapped.
>
> - in the case of soft mmu
> - many spurious page fault due to tlb lazily flushed
> - lots of write-protect page fault (dirty bit track for guest pte, shadow
> page table write-protected, frame-buffer under Xwindow, migration, ...)
>
>
> * Implementation
> We can freely walk the page between walk_shadow_page_lockless_begin and
> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
>
> In the most case, cmpxchg is fair enough to change the access bit of spte,
> but the write-protect path on softmmu/nested mmu is a especial case: it is
> a read-check-modify path: read spte, check W bit, then clear W bit.
We also set gpte.D and gpte.A, no? How do you handle that?
> In order
> to avoid marking spte writable after/during page write-protect, we do the
> trick like below:
>
> fast page fault path:
> lock RCU
> set identification in the spte
What if you can't (already taken)? Spin? Slow path?
> smp_mb()
> if (!rmap.PTE_LIST_WRITE_PROTECT)
> cmpxchg + w - vcpu-id
> unlock RCU
>
> write protect path:
> lock mmu-lock
> set rmap.PTE_LIST_WRITE_PROTECT
> smp_mb()
> if (spte.w || spte has identification)
> clear w bit and identification
> unlock mmu-lock
>
> Setting identification in the spte is used to notify page-protect path to
> modify the spte, then we can see the change in the cmpxchg.
>
> Setting identification is also a trick: it only set the last bit of spte
> that does not change the mapping and lose cpu status bits.
There are plenty of available bits, 53-62.
>
> The identification should be unique to avoid the below race:
>
> VCPU 0 VCPU 1 VCPU 2
> lock RCU
> spte + identification
> check conditions
> do write-protect, clear
> identification
> lock RCU
> set identification
> cmpxchg + w - identification
> OOPS!!!
Is it not sufficient to use just two bits?
pf_lock - taken by page fault path
wp_lock - taken by write protect path
pf cmpxchg checks both bits.
> We choose the vcpu id as the unique value, currently, 254 vcpus on VMX
> and 127 vcpus on softmmu can be fast. Keep it simply firtsly. :)
>
>
> * Performance
> It introduces a full memory barrier on the page write-protect path, i
> have done the test of kernbench in the text mode which does not generate
> write-protect page fault by frame-buffer avoiding the optimization
> introduced by this patch, it shows no regression.
>
> And there is the result tested by x11perf and migration on autotest:
>
> x11perf (x11perf -repeat 10 -comppixwin500):
> (Host: Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz * 4 + 4G
> Guest: 4 vcpus + 1G)
>
> - For ept:
> $ x11perfcomp baseline-hard optimaze-hard
> 1: baseline-hard
> 2: optimaze-hard
>
> 1 2 Operation
> -------- -------- ---------
> 7060.0 7150.0 Composite 500x500 from pixmap to window
>
> - For shadow mmu:
> $ x11perfcomp baseline-soft optimaze-soft
> 1: baseline-soft
> 2: optimaze-soft
>
> 1 2 Operation
> -------- -------- ---------
> 6980.0 7490.0 Composite 500x500 from pixmap to window
>
> ( It is interesting that after this patch, the performance of x11perf on
> softmmu is better than it on hardmmu, i have tested it for many times,
> it is really true. :) )
It could be because you cannot use THP with dirty logging, so you pay
the overhead of TDP.
> autotest migration:
> (Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 32G)
>
> - For ept:
>
> Before:
> smp2.Fedora.16.64.migrate
> Times .unix .with_autotest.dbench.unix total
> 1 102 204 309
> 2 68 203 275
> 3 67 218 289
>
> After:
> smp2.Fedora.16.64.migrate
> Times .unix .with_autotest.dbench.unix total
> 1 103 189 295
> 2 67 188 259
> 3 64 202 271
>
>
> - For shadow mmu:
>
> Before:
> smp2.Fedora.16.64.migrate
> Times .unix .with_autotest.dbench.unix total
> 1 102 262 368
> 2 68 220 292
> 3 68 234 307
>
> After:
> smp2.Fedora.16.64.migrate
> Times .unix .with_autotest.dbench.unix total
> 1 104 231 341
> 2 68 218 289
> 3 66 205 275
>
>
> Any comments are welcome. :)
>
Very impressive. Now to review the patches (will take me some time).
--
error compiling committee.c: too many arguments to function
next prev parent reply other threads:[~2012-03-29 10:18 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-29 9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
2012-03-29 9:20 ` [PATCH 01/13] KVM: MMU: properly assert spte on rmap_next path Xiao Guangrong
2012-03-29 9:21 ` [PATCH 02/13] KVM: MMU: abstract spte write-protect Xiao Guangrong
2012-03-29 11:11 ` Avi Kivity
2012-03-29 11:51 ` Xiao Guangrong
2012-03-29 9:22 ` [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte) Xiao Guangrong
2012-03-29 13:00 ` Avi Kivity
2012-03-30 3:51 ` Xiao Guangrong
2012-03-29 9:22 ` [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa) Xiao Guangrong
2012-03-29 13:07 ` Avi Kivity
2012-03-30 5:01 ` Xiao Guangrong
2012-04-01 12:42 ` Avi Kivity
2012-03-29 9:23 ` [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask Xiao Guangrong
2012-03-29 13:10 ` Avi Kivity
2012-03-29 15:28 ` Avi Kivity
2012-03-29 16:24 ` Avi Kivity
2012-03-29 9:23 ` [PATCH 06/13] KVM: VMX: export PFEC.P bit on ept Xiao Guangrong
2012-03-29 9:24 ` [PATCH 07/13] KVM: MMU: store more bits in rmap Xiao Guangrong
2012-03-29 9:25 ` [PATCH 08/13] KVM: MMU: fask check whether page is writable Xiao Guangrong
2012-03-29 15:49 ` Avi Kivity
2012-03-30 5:10 ` Xiao Guangrong
2012-04-01 15:52 ` Avi Kivity
2012-04-05 17:54 ` Xiao Guangrong
2012-04-12 23:08 ` Marcelo Tosatti
2012-04-13 10:26 ` Xiao Guangrong
2012-03-29 9:25 ` [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock Xiao Guangrong
2012-04-01 15:53 ` Avi Kivity
2012-04-05 18:25 ` Xiao Guangrong
2012-04-09 12:28 ` Avi Kivity
2012-04-09 13:16 ` Takuya Yoshikawa
2012-04-09 13:21 ` Avi Kivity
2012-03-29 9:26 ` [PATCH 10/13] KVM: MMU: store vcpu id in spte to notify page write-protect path Xiao Guangrong
2012-03-29 9:27 ` [PATCH 11/13] KVM: MMU: fast path of handling guest page fault Xiao Guangrong
2012-03-31 12:24 ` Xiao Guangrong
2012-04-01 16:23 ` Avi Kivity
2012-04-03 13:04 ` Avi Kivity
2012-04-05 19:39 ` Xiao Guangrong
2012-03-29 9:27 ` [PATCH 12/13] KVM: MMU: trace fast " Xiao Guangrong
2012-03-29 9:28 ` [PATCH 13/13] KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint Xiao Guangrong
2012-03-29 10:18 ` Avi Kivity [this message]
2012-03-29 11:40 ` [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
2012-03-29 12:57 ` Avi Kivity
2012-03-30 9:18 ` Xiao Guangrong
2012-03-31 13:12 ` Xiao Guangrong
2012-04-01 12:58 ` Avi Kivity
2012-04-05 21:57 ` Xiao Guangrong
2012-04-06 5:24 ` Xiao Guangrong
2012-04-09 13:20 ` Avi Kivity
2012-04-09 13:59 ` Xiao Guangrong
2012-04-09 13:12 ` Avi Kivity
2012-04-09 13:55 ` Xiao Guangrong
2012-04-09 14:01 ` Xiao Guangrong
2012-04-09 14:25 ` Avi Kivity
2012-04-09 17:58 ` Marcelo Tosatti
2012-04-09 18:13 ` Xiao Guangrong
2012-04-09 19:31 ` Marcelo Tosatti
2012-04-09 18:26 ` Xiao Guangrong
2012-04-09 19:46 ` Marcelo Tosatti
2012-04-10 3:06 ` Xiao Guangrong
2012-04-10 10:04 ` Avi Kivity
2012-04-11 1:47 ` Marcelo Tosatti
2012-04-11 9:15 ` Avi Kivity
2012-04-10 10:39 ` Avi Kivity
2012-04-10 11:40 ` Takuya Yoshikawa
2012-04-10 11:58 ` Xiao Guangrong
2012-04-11 12:15 ` Takuya Yoshikawa
2012-04-11 12:38 ` Xiao Guangrong
2012-04-11 14:14 ` Takuya Yoshikawa
2012-04-11 14:21 ` Avi Kivity
2012-04-11 22:26 ` Takuya Yoshikawa
2012-04-13 14:25 ` Takuya Yoshikawa
2012-04-15 9:32 ` Avi Kivity
2012-04-16 15:49 ` Takuya Yoshikawa
2012-04-16 15:49 ` Takuya Yoshikawa
2012-04-16 16:02 ` Avi Kivity
2012-04-16 16:02 ` Avi Kivity
2012-04-17 6:26 ` Xiao Guangrong
2012-04-17 6:26 ` Xiao Guangrong
2012-04-17 7:51 ` Avi Kivity
2012-04-17 7:51 ` Avi Kivity
2012-04-17 12:37 ` Takuya Yoshikawa
2012-04-17 12:37 ` Takuya Yoshikawa
2012-04-17 12:41 ` Avi Kivity
2012-04-17 12:41 ` Avi Kivity
2012-04-17 14:54 ` Takuya Yoshikawa
2012-04-17 14:54 ` Takuya Yoshikawa
2012-04-17 14:56 ` Avi Kivity
2012-04-17 14:56 ` Avi Kivity
2012-04-18 13:42 ` Takuya Yoshikawa
2012-04-18 13:42 ` Takuya Yoshikawa
2012-04-17 6:16 ` Xiao Guangrong
2012-04-10 10:10 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F7436FB.9000004@redhat.com \
--to=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mtosatti@redhat.com \
--cc=xiaoguangrong@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.