linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Avi Kivity <avi@redhat.com>
To: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>, KVM <kvm@vger.kernel.org>
Subject: Re: [PATCH 00/13] KVM: MMU: fast page fault
Date: Thu, 29 Mar 2012 12:18:35 +0200	[thread overview]
Message-ID: <4F7436FB.9000004@redhat.com> (raw)
In-Reply-To: <4F742951.7080003@linux.vnet.ibm.com>

On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> * Idea
> The present bit of page fault error code (EFEC.P) indicates whether the
> page table is populated on all levels, if this bit is set, we can know
> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> the reserved bits.
>
> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> simply fixed: the page fault caused by reserved bit
> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> is just increasing the corresponding access on the spte.
>
> This pachset introduces a fast path to fix this kind of page fault: it
> is out of mmu-lock and need not walk host page table to get the mapping
> from gfn to pfn.

Wow!

Looks like interesting times are back in mmu-land.

Comments below are before review of actual patches, so maybe they're
already answered there, or maybe they're just nonsense.

> * Advantage
> - it is really fast
>   it fixes page fault out of mmu-lock, and uses a very light way to avoid
>   the race with other pathes. Also, it fixes page fault in the front of
>   gfn_to_pfn, it means no host page table walking.
>
> - we can get lots of page fault with PFEC.P = 1 in KVM:
>   - in the case of ept/npt
>    after shadow page become stable (all gfn is mapped in shadow page table,
>    it is a short stage since only one shadow page table is used and only a
>    few of page is needed), almost all page fault is caused by write-protect
>    (frame-buffer under Xwindow, migration), the other small part is caused
>    by page merge/COW under KSM/THP.
>
>   We do not hope it can fix the page fault caused by the read-only host
>   page of KSM, since after COW, all the spte pointing to the gfn will be
>   unmapped.
>
> - in the case of soft mmu
>   - many spurious page fault due to tlb lazily flushed
>   - lots of write-protect page fault (dirty bit track for guest pte, shadow
>     page table write-protected, frame-buffer under Xwindow, migration, ...)
>
>
> * Implementation
> We can freely walk the page between walk_shadow_page_lockless_begin and
> walk_shadow_page_lockless_end, it can ensure all the shadow page is valid.
>
> In the most case, cmpxchg is fair enough to change the access bit of spte,
> but the write-protect path on softmmu/nested mmu is a especial case: it is
> a read-check-modify path: read spte, check W bit, then clear W bit.

We also set gpte.D and gpte.A, no? How do you handle that?

>  In order
> to avoid marking spte writable after/during page write-protect, we do the
> trick like below:
>
>       fast page fault path:
>             lock RCU
>             set identification in the spte

What if you can't (already taken)?  Spin?  Slow path?

>             smp_mb()
>             if (!rmap.PTE_LIST_WRITE_PROTECT)
>                  cmpxchg + w - vcpu-id
>             unlock RCU
>
>       write protect path:
>             lock mmu-lock
>             set rmap.PTE_LIST_WRITE_PROTECT
>                  smp_mb()
>             if (spte.w || spte has identification)
>                  clear w bit and identification
>             unlock mmu-lock
>
> Setting identification in the spte is used to notify page-protect path to
> modify the spte, then we can see the change in the cmpxchg.
>
> Setting identification is also a trick: it only set the last bit of spte
> that does not change the mapping and lose cpu status bits.

There are plenty of available bits, 53-62.

>
> The identification should be unique to avoid the below race:
>
>      VCPU 0                VCPU 1            VCPU 2
>       lock RCU
>    spte + identification
>    check conditions
>                        do write-protect, clear
>                           identification
>                                               lock RCU
>                                         set identification
>      cmpxchg + w - identification
>         OOPS!!!

Is it not sufficient to use just two bits?

pf_lock - taken by page fault path
wp_lock - taken by write protect path

pf cmpxchg checks both bits.

> We choose the vcpu id as the unique value, currently, 254 vcpus on VMX
> and 127 vcpus on softmmu can be fast. Keep it simply firtsly. :)
>
>
> * Performance
> It introduces a full memory barrier on the page write-protect path, i
> have done the test of kernbench in the text mode which does not generate
> write-protect page fault by frame-buffer avoiding the optimization
> introduced by this patch, it shows no regression.
>
> And there is the result tested by x11perf and migration on autotest:
>
> x11perf (x11perf -repeat 10 -comppixwin500):
> (Host: Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz * 4 + 4G
>  Guest: 4 vcpus + 1G)
>
> - For ept:
> $ x11perfcomp baseline-hard optimaze-hard
> 1: baseline-hard
> 2: optimaze-hard
>
>      1         2    Operation
> --------  --------  ---------
>   7060.0    7150.0  Composite 500x500 from pixmap to window
>
> - For shadow mmu:
> $ x11perfcomp baseline-soft optimaze-soft
> 1: baseline-soft
> 2: optimaze-soft
>
>      1         2    Operation
> --------  --------  ---------
>   6980.0    7490.0  Composite 500x500 from pixmap to window
>
> ( It is interesting that after this patch, the performance of x11perf on
>   softmmu is better than it on hardmmu, i have tested it for many times,
>   it is really true. :) )

It could be because you cannot use THP with dirty logging, so you pay
the overhead of TDP.

> autotest migration:
> (Host: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz * 12 + 32G)
>
> - For ept:
>
> Before:
>                     smp2.Fedora.16.64.migrate
> Times   .unix      .with_autotest.dbench.unix     total
>  1       102           204                         309
>  2       68            203                         275
>  3       67            218                         289
>
> After:
>                     smp2.Fedora.16.64.migrate
> Times   .unix      .with_autotest.dbench.unix     total
>  1       103           189                         295
>  2       67            188                         259
>  3       64            202                         271
>
>
> - For shadow mmu:
>
> Before:
>                     smp2.Fedora.16.64.migrate
> Times   .unix      .with_autotest.dbench.unix     total
>  1       102           262                         368
>  2       68            220                         292
>  3       68            234                         307
>
> After:
>                     smp2.Fedora.16.64.migrate
> Times   .unix      .with_autotest.dbench.unix     total
>  1       104           231                         341
>  2       68            218                         289
>  3       66            205                         275
>
>
> Any comments are welcome. :)
>

Very impressive.  Now to review the patches (will take me some time).

-- 
error compiling committee.c: too many arguments to function


  parent reply	other threads:[~2012-03-29 10:18 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-29  9:20 [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
2012-03-29  9:20 ` [PATCH 01/13] KVM: MMU: properly assert spte on rmap_next path Xiao Guangrong
2012-03-29  9:21 ` [PATCH 02/13] KVM: MMU: abstract spte write-protect Xiao Guangrong
2012-03-29 11:11   ` Avi Kivity
2012-03-29 11:51     ` Xiao Guangrong
2012-03-29  9:22 ` [PATCH 03/13] KVM: MMU: split FNAME(prefetch_invalid_gpte) Xiao Guangrong
2012-03-29 13:00   ` Avi Kivity
2012-03-30  3:51     ` Xiao Guangrong
2012-03-29  9:22 ` [PATCH 04/13] KVM: MMU: introduce FNAME(get_sp_gpa) Xiao Guangrong
2012-03-29 13:07   ` Avi Kivity
2012-03-30  5:01     ` Xiao Guangrong
2012-04-01 12:42       ` Avi Kivity
2012-03-29  9:23 ` [PATCH 05/13] KVM: MMU: reset shadow_mmio_mask Xiao Guangrong
2012-03-29 13:10   ` Avi Kivity
2012-03-29 15:28     ` Avi Kivity
2012-03-29 16:24       ` Avi Kivity
2012-03-29  9:23 ` [PATCH 06/13] KVM: VMX: export PFEC.P bit on ept Xiao Guangrong
2012-03-29  9:24 ` [PATCH 07/13] KVM: MMU: store more bits in rmap Xiao Guangrong
2012-03-29  9:25 ` [PATCH 08/13] KVM: MMU: fask check whether page is writable Xiao Guangrong
2012-03-29 15:49   ` Avi Kivity
2012-03-30  5:10     ` Xiao Guangrong
2012-04-01 15:52   ` Avi Kivity
2012-04-05 17:54     ` Xiao Guangrong
2012-04-12 23:08       ` Marcelo Tosatti
2012-04-13 10:26         ` Xiao Guangrong
2012-03-29  9:25 ` [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock Xiao Guangrong
2012-04-01 15:53   ` Avi Kivity
2012-04-05 18:25     ` Xiao Guangrong
2012-04-09 12:28       ` Avi Kivity
2012-04-09 13:16         ` Takuya Yoshikawa
2012-04-09 13:21           ` Avi Kivity
2012-03-29  9:26 ` [PATCH 10/13] KVM: MMU: store vcpu id in spte to notify page write-protect path Xiao Guangrong
2012-03-29  9:27 ` [PATCH 11/13] KVM: MMU: fast path of handling guest page fault Xiao Guangrong
2012-03-31 12:24   ` Xiao Guangrong
2012-04-01 16:23   ` Avi Kivity
2012-04-03 13:04     ` Avi Kivity
2012-04-05 19:39     ` Xiao Guangrong
2012-03-29  9:27 ` [PATCH 12/13] KVM: MMU: trace fast " Xiao Guangrong
2012-03-29  9:28 ` [PATCH 13/13] KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint Xiao Guangrong
2012-03-29 10:18 ` Avi Kivity [this message]
2012-03-29 11:40   ` [PATCH 00/13] KVM: MMU: fast page fault Xiao Guangrong
2012-03-29 12:57     ` Avi Kivity
2012-03-30  9:18       ` Xiao Guangrong
2012-03-31 13:12         ` Xiao Guangrong
2012-04-01 12:58         ` Avi Kivity
2012-04-05 21:57           ` Xiao Guangrong
2012-04-06  5:24             ` Xiao Guangrong
2012-04-09 13:20               ` Avi Kivity
2012-04-09 13:59                 ` Xiao Guangrong
2012-04-09 13:12 ` Avi Kivity
2012-04-09 13:55   ` Xiao Guangrong
2012-04-09 14:01     ` Xiao Guangrong
2012-04-09 14:25     ` Avi Kivity
2012-04-09 17:58   ` Marcelo Tosatti
2012-04-09 18:13     ` Xiao Guangrong
2012-04-09 19:31       ` Marcelo Tosatti
2012-04-09 18:26     ` Xiao Guangrong
2012-04-09 19:46       ` Marcelo Tosatti
2012-04-10  3:06         ` Xiao Guangrong
2012-04-10 10:04         ` Avi Kivity
2012-04-11  1:47           ` Marcelo Tosatti
2012-04-11  9:15             ` Avi Kivity
2012-04-10 10:39         ` Avi Kivity
2012-04-10 11:40           ` Takuya Yoshikawa
2012-04-10 11:58             ` Xiao Guangrong
2012-04-11 12:15               ` Takuya Yoshikawa
2012-04-11 12:38                 ` Xiao Guangrong
2012-04-11 14:14                   ` Takuya Yoshikawa
2012-04-11 14:21                     ` Avi Kivity
2012-04-11 22:26                       ` Takuya Yoshikawa
2012-04-13 14:25                     ` Takuya Yoshikawa
2012-04-15  9:32                       ` Avi Kivity
2012-04-16 15:49                         ` Takuya Yoshikawa
2012-04-16 16:02                           ` Avi Kivity
2012-04-17  6:26                           ` Xiao Guangrong
2012-04-17  7:51                             ` Avi Kivity
2012-04-17 12:37                               ` Takuya Yoshikawa
2012-04-17 12:41                                 ` Avi Kivity
2012-04-17 14:54                                   ` Takuya Yoshikawa
2012-04-17 14:56                                     ` Avi Kivity
2012-04-18 13:42                                       ` Takuya Yoshikawa
2012-04-17  6:16                         ` Xiao Guangrong
2012-04-10 10:10       ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F7436FB.9000004@redhat.com \
    --to=avi@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=xiaoguangrong@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).