From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [PATCH 00/13] KVM: MMU: fast page fault Date: Thu, 29 Mar 2012 12:18:35 +0200 Message-ID: <4F7436FB.9000004@redhat.com> References: <4F742951.7080003@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Marcelo Tosatti , LKML , KVM To: Xiao Guangrong Return-path: In-Reply-To: <4F742951.7080003@linux.vnet.ibm.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On 03/29/2012 11:20 AM, Xiao Guangrong wrote: > * Idea > The present bit of page fault error code (EFEC.P) indicates whether t= he > page table is populated on all levels, if this bit is set, we can kno= w > the page fault is caused by the page-protection bits (e.g. W/R bit) o= r > the reserved bits. > > In KVM, in most cases, all this kind of page fault (EFEC.P =3D 1) can= be > simply fixed: the page fault caused by reserved bit > (EFFC.P =3D 1 && EFEC.RSV =3D 1) has already been filtered out in fas= t mmio > path. What we need do to fix the rest page fault (EFEC.P =3D 1 && RSV= !=3D 1) > is just increasing the corresponding access on the spte. > > This pachset introduces a fast path to fix this kind of page fault: i= t > is out of mmu-lock and need not walk host page table to get the mappi= ng > from gfn to pfn. Wow! Looks like interesting times are back in mmu-land. Comments below are before review of actual patches, so maybe they're already answered there, or maybe they're just nonsense. > * Advantage > - it is really fast > it fixes page fault out of mmu-lock, and uses a very light way to a= void > the race with other pathes. Also, it fixes page fault in the front = of > gfn_to_pfn, it means no host page table walking. > > - we can get lots of page fault with PFEC.P =3D 1 in KVM: > - in the case of ept/npt > =E3=80=80after shadow page become stable (all gfn is mapped in shad= ow page table, > =E3=80=80it is a short stage since only one shadow page table is us= ed and only a > =E3=80=80few of page is needed), almost all page fault is caused by= write-protect > =E3=80=80(frame-buffer under Xwindow, migration), the other small p= art is caused > =E3=80=80by page merge/COW under KSM/THP. > > We do not hope it can fix the page fault caused by the read-only ho= st > page of KSM, since after COW, all the spte pointing to the gfn will= be > unmapped. > > - in the case of soft mmu > - many spurious page fault due to tlb lazily flushed > - lots of write-protect page fault (dirty bit track for guest pte, = shadow > page table write-protected, frame-buffer under Xwindow, migration= , ...) > > > * Implementation > We can freely walk the page between walk_shadow_page_lockless_begin a= nd > walk_shadow_page_lockless_end, it can ensure all the shadow page is v= alid. > > In the most case, cmpxchg is fair enough to change the access bit of = spte, > but the write-protect path on softmmu/nested mmu is a especial case: = it is > a read-check-modify path: read spte, check W bit, then clear W bit. We also set gpte.D and gpte.A, no? How do you handle that? > In order > to avoid marking spte writable after/during page write-protect, we do= the > trick like below: > > fast page fault path: > lock RCU > set identification in the spte What if you can't (already taken)? Spin? Slow path? > smp_mb() > if (!rmap.PTE_LIST_WRITE_PROTECT) > cmpxchg + w - vcpu-id > unlock RCU > > write protect path: > lock mmu-lock > set rmap.PTE_LIST_WRITE_PROTECT > smp_mb() > if (spte.w || spte has identification) > clear w bit and identification > unlock mmu-lock > > Setting identification in the spte is used to notify page-protect pat= h to > modify the spte, then we can see the change in the cmpxchg. > > Setting identification is also a trick: it only set the last bit of s= pte > that does not change the mapping and lose cpu status bits. There are plenty of available bits, 53-62. > > The identification should be unique to avoid the below race: > > VCPU 0 VCPU 1 VCPU 2 > lock RCU > spte + identification > check conditions > do write-protect, clear > identification > lock RCU > set identification > cmpxchg + w - identification > OOPS!!! Is it not sufficient to use just two bits? pf_lock - taken by page fault path wp_lock - taken by write protect path pf cmpxchg checks both bits. > We choose the vcpu id as the unique value, currently, 254 vcpus on VM= X > and 127 vcpus on softmmu can be fast. Keep it simply firtsly. :) > > > * Performance > It introduces a full memory barrier on the page write-protect path, i > have done the test of kernbench in the text mode which does not gener= ate > write-protect page fault by frame-buffer avoiding the optimization > introduced by this patch, it shows no regression. > > And there is the result tested by x11perf and migration on autotest: > > x11perf (x11perf -repeat 10 -comppixwin500): > (Host: Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz * 4 + 4G > Guest: 4 vcpus + 1G) > > - For ept: > $ x11perfcomp baseline-hard optimaze-hard > 1: baseline-hard > 2: optimaze-hard > > 1 2 Operation > -------- -------- --------- > 7060.0 7150.0 Composite 500x500 from pixmap to window > > - For shadow mmu: > $ x11perfcomp baseline-soft optimaze-soft > 1: baseline-soft > 2: optimaze-soft > > 1 2 Operation > -------- -------- --------- > 6980.0 7490.0 Composite 500x500 from pixmap to window > > ( It is interesting that after this patch, the performance of x11perf= on > softmmu is better than it on hardmmu, i have tested it for many tim= es, > it is really true. :) ) It could be because you cannot use THP with dirty logging, so you pay the overhead of TDP. > autotest migration: > (Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 32G) > > - For ept: > > Before: > smp2.Fedora.16.64.migrate > Times .unix .with_autotest.dbench.unix total > 1 102 204 309 > 2 68 203 275 > 3 67 218 289 > > After: > smp2.Fedora.16.64.migrate > Times .unix .with_autotest.dbench.unix total > 1 103 189 295 > 2 67 188 259 > 3 64 202 271 > > > - For shadow mmu: > > Before: > smp2.Fedora.16.64.migrate > Times .unix .with_autotest.dbench.unix total > 1 102 262 368 > 2 68 220 292 > 3 68 234 307 > > After: > smp2.Fedora.16.64.migrate > Times .unix .with_autotest.dbench.unix total > 1 104 231 341 > 2 68 218 289 > 3 66 205 275 > > > Any comments are welcome. :) > Very impressive. Now to review the patches (will take me some time). --=20 error compiling committee.c: too many arguments to function