Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Takuya Yoshikawa <takuya.yoshikawa@gmail.com>
To: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: gleb@redhat.com, avi.kivity@gmail.com, mtosatti@redhat.com,
	pbonzini@redhat.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org
Subject: Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect
Date: Sat, 3 Aug 2013 14:09:43 +0900	[thread overview]
Message-ID: <20130803140943.3e09bbcda47693789df06315@gmail.com> (raw)
In-Reply-To: <1375189330-24066-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com>

On Tue, 30 Jul 2013 21:01:58 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> Background
> ==========
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we need to
> hold mmu-lock which is also required by vcpu to fix its page table fault and
> mmu-notifier when host page is being changed. In the extreme cpu / memory used
> guest, it becomes a scalability issue.
> 
> This patchset introduces a way to locklessly write-protect guest memory.

Nice improvements!

If I read the patch set correctly, this work contains the following changes:

Cleanups:
        Patch 1 and patch 12.

Lazy large page dropping for dirty logging:
        Patch 2-3.
        Patch 2 is preparatory to patch 3.

        This does not look like an RFC if you address Marcelo's comment.
        Any reason to include this in an RFC patch set?

Making remote TLBs flushable outside of mmu_lock for dirty logging:
        Patch 6.

        This is nice.  I'm locally using a similar patch for my work, but yours
        is much cleaner and better.  I hope this will get merged soon.

New Pte-list handling:
        Patch 7-9.

        Still reading the details.

RCU-based lockless write protection.
        Patch 10-11.

        If I understand RCU correctly, the current implementation has a problem:
        read-side critical sections can become too long.

        See the following LWN's article:
        "Sleepable RCU"
        https://lwn.net/Articles/202847/

        Especially, kvm_mmu_slot_remove_write_access() can take hundreds of
        milliseconds, or even a few seconds for guests using shadow paging.
        Is it possible to break the read-side critical section after protecting
        some pages? -- I guess so.

Anyway, I want to see the following non-RFC quality patches get merged first:
        - Lazy large page dropping for dirty logging:
        - Making remote TLBs flushable outside of mmu_lock for dirty logging

As you are doing in patch 11, the latter can eliminate the TLB flushes before
cond_resched_lock().  So this alone is an optimization, and since my work is
based on this TLB flush-less lock breaking, I would appriciate if you make this
change first in your clean way.

The remaining patches, pte-list refactoring and lock-less ones, also look
interesting, but I need to read more to understand them.

Thanks for the nice work!
        Takuya


> 
> Idea
> ==========
> There are the challenges we meet and the ideas to resolve them.
> 
> 1) How to locklessly walk rmap?
> The first idea we got to prevent "desc" being freed when we are walking the
> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
> it updates the rmap really frequently.
> 
> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
> to be reused more quickly. We also store a "nulls" in the last "desc"
> (desc->more) which can help us to detect whether the "desc" is moved to anther
> rmap then we can re-walk the rmap if that happened. I learned this idea from
> nulls-list.
> 
> Another issue is, when a spte is deleted from the "desc", another spte in the
> last "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward move
> the entry: when a spte is deleted, we move the entry in the first desc to that
> position.
> 
> 2) How to locklessly access shadow page table?
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl context
> and the paths we are optimizing for have heavy workload, disabling interrupt is
> not good for the system performance.
> 
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability.
> 
> 3) How to locklessly write-protect guest memory?
> Currently, there are two behaviors when we write-protect guest memory, one is
> clearing the Writable bit on spte and the another one is dropping spte when it
> points to large page. The former is easy we only need to atomicly clear a bit
> but the latter is hard since we need to remove the spte from rmap. so we unify
> these two behaviors that only make the spte readonly. Making large spte
> readonly instead of nonpresent is also good for reducing jitter.
> 
> And we need to pay more attention on the order of making spte writable, adding
> spte into rmap and setting the corresponding bit on dirty bitmap since
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
> we should ensure the writable spte can be found in rmap before the dirty bitmap
> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
> the page.
> 
> Performance result
> ====================
> Host: CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz x 12
> Mem: 36G
> 
> The benchmark i used and will be attached:
> a) kernbench
> b) migrate-perf
>    it emulates guest migration
> c) mmtest
>    it repeatedly writes the memory and measures the time and is used to
>    generate memory access in the guest which is being migrated
> d) Qemu monitor command to implement guest live migration
>    the script can be found in migrate-perf.
>   
> 
> 1) First, we use kernbench to benchmark the performance with non-write-protection
>   case to detect the possible regression:
> 
>   EPT enabled:  Base: 84.05      After the patch: 83.53
>   EPT disabled: Base: 142.57     After the patch: 141.70
> 
>   No regression and the optimization may come from lazily drop large spte.
> 
> 2) Benchmark the performance of get dirty page
>    (./migrate-perf -c 12 -m 3000 -t 20)
> 
>    Base: Run 20 times, Avg time:24813809 ns.
>    After the patch: Run 20 times, Avg time:8371577 ns.
>    
>    It improves +196%
>   
> 3) There is the result of Live Migration:
>    3.1) Less vcpus, less memory and less dirty page generated
>         (
>           Guest config: MEM_SIZE=7G        VCPU_NUM=6
>           The workload in migrated guest:
>           ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 3000 -c 30 -t 60 > result &"
>         )
> 
>                Live Migration time (ms)   Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT    | Baseline |     21638           |  266601028            |
>        + -------------------------------+-------------+---------+
>        |   After  |     21110    +2.5%  |  264966696    +0.6%   |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline |     22542           |  271969284  |         |
>        +----------+---------------------+-------------+---------+
>        |  After   |     21641    +4.1%  |  270485511    +0.5%   |
> -------+----------+---------------------------------------------+
> 
>    3.2) More vcpus, more memory and less dirty page generated
>        (
>          Guest config: MEM_SIZE=25G VCPU_NUM=12
>          The workload in migrated guest:
>          ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 30 > result &"
>        )
> 
>                Live Migration time (ms)   Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT    | Baseline |     72773           |  1278228350           |
>        + -------------------------------+-------------+---------+
>        |   After  |     70516     +3.2% |  1266581587   +0.9%   |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline |     74198           |  1323180090 |         |
>        +----------+---------------------+-------------+---------+
>        |  After   |     64948   +14.2%  |  1299283302   +1.8%  |
> -------+----------+---------------------------------------------+
> 
>    3.3) Less vcpus, more memory and huge dirty page generated
>         ( 
>           Guest config: MEM_SIZE=25G VCPU_NUM=6
>           The workload in migrated guest:
>           ssh -f $CLIENT "cd ~; rm -f result; nohup /home/eric/mmtest/mmtest -m 15000 -c 30 -t 200 > result &"
>         )
> 
>                Live Migration time (ms)   Benchmark (ns)
> ----------------------------------------+-------------+---------+
> EPT    | Baseline |     267473          |  1224657502           |
>        + -------------------------------+-------------+---------+
>        |   After  |     267374   +0.03% |  1221520513   +0.6%   |
> ----------------------------------------+-------------+---------+
> Shadow | Baseline |     369999          |  1712004428 |         |
>        +----------+---------------------+-------------+---------+
>        |  After   |     335737   +10.2% |  1556065063   +10.2%  |
> -------+----------+---------------------------------------------+
> 
>    For the case of 3.3), EPT gets small benefit, the reason is only the first
>    time guest writes memory need take mmu-lock to mark spte from nonpresent to
>    present. Other writes cost lots of time to trigger the page fault due to
>    write-protection which are fixed by fast page fault which need not take
>    mmu-lock.
> 
> Xiao Guangrong (12):
>   KVM: MMU: remove unused parameter
>   KVM: MMU: properly check last spte in fast_page_fault()
>   KVM: MMU: lazily drop large spte
>   KVM: MMU: log dirty page after marking spte writable
>   KVM: MMU: add spte into rmap before logging dirty page
>   KVM: MMU: flush tlb if the spte can be locklessly modified
>   KVM: MMU: redesign the algorithm of pte_list
>   KVM: MMU: introduce nulls desc
>   KVM: MMU: introduce pte-list lockless walker
>   KVM: MMU: allow locklessly access shadow page table out of vcpu thread
>   KVM: MMU: locklessly write-protect the page
>   KVM: MMU: clean up spte_write_protect
> 
>  arch/x86/include/asm/kvm_host.h |  10 +-
>  arch/x86/kvm/mmu.c              | 442 ++++++++++++++++++++++++++++------------
>  arch/x86/kvm/mmu.h              |  28 +++
>  arch/x86/kvm/x86.c              |  19 +-
>  4 files changed, 356 insertions(+), 143 deletions(-)
> 
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Takuya Yoshikawa <takuya.yoshikawa@gmail.com>

next prev parent reply	other threads:[~2013-08-03  5:09 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-30 13:01 [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-07-30 13:01 ` [PATCH 01/12] KVM: MMU: remove unused parameter Xiao Guangrong
2013-08-29  7:22   ` Gleb Natapov
2013-07-30 13:02 ` [PATCH 02/12] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
2013-07-30 13:02 ` [PATCH 03/12] KVM: MMU: lazily drop large spte Xiao Guangrong
2013-08-02 14:55   ` Marcelo Tosatti
2013-08-02 15:42     ` Xiao Guangrong
2013-08-02 20:27       ` Marcelo Tosatti
2013-08-02 22:56         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable Xiao Guangrong
2013-07-30 13:26   ` Paolo Bonzini
2013-07-31  7:25     ` Xiao Guangrong
2013-08-07  1:48   ` Marcelo Tosatti
2013-08-07  4:06     ` Xiao Guangrong
2013-08-08 15:06       ` Marcelo Tosatti
2013-08-08 16:26         ` Xiao Guangrong
2013-11-20  0:29       ` Marcelo Tosatti
2013-11-20  0:35         ` Marcelo Tosatti
2013-11-20 14:20         ` Xiao Guangrong
2013-11-20 19:47           ` Marcelo Tosatti
2013-11-21  4:26             ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page Xiao Guangrong
2013-07-30 13:27   ` Paolo Bonzini
2013-07-31  7:33     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
2013-08-28  7:23   ` Gleb Natapov
2013-08-28  7:50     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
2013-08-28  8:12   ` Gleb Natapov
2013-08-28  8:37     ` Xiao Guangrong
2013-08-28  8:58       ` Gleb Natapov
2013-08-28  9:19         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 08/12] KVM: MMU: introduce nulls desc Xiao Guangrong
2013-08-28  8:40   ` Gleb Natapov
2013-08-28  8:54     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
2013-08-28  9:20   ` Gleb Natapov
2013-08-28  9:33     ` Xiao Guangrong
2013-08-28  9:46       ` Gleb Natapov
2013-08-28 10:13         ` Xiao Guangrong
2013-08-28 10:49           ` Gleb Natapov
2013-08-28 12:15             ` Xiao Guangrong
2013-08-28 13:36               ` Gleb Natapov
2013-08-29  6:50                 ` Xiao Guangrong
2013-08-29  9:08                   ` Gleb Natapov
2013-08-29  9:31                     ` Xiao Guangrong
2013-08-29  9:51                       ` Gleb Natapov
2013-08-29 11:26                         ` Xiao Guangrong
2013-08-30 11:38                           ` Gleb Natapov
2013-09-02  7:02                             ` Xiao Guangrong
2013-08-29  9:31                   ` Gleb Natapov
2013-08-29 11:33                     ` Xiao Guangrong
2013-08-29 12:02                       ` Xiao Guangrong
2013-08-30 11:44                         ` Gleb Natapov
2013-09-02  8:50                           ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread Xiao Guangrong
2013-08-07 13:09   ` Takuya Yoshikawa
2013-08-07 13:19     ` Xiao Guangrong
2013-08-29  9:10   ` Gleb Natapov
2013-08-29  9:25     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 11/12] KVM: MMU: locklessly write-protect the page Xiao Guangrong
2013-07-30 13:02 ` [PATCH 12/12] KVM: MMU: clean up spte_write_protect Xiao Guangrong
2013-07-30 13:11 ` [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-08-03  5:09 ` Takuya Yoshikawa [this message]
2013-08-04 14:15   ` Xiao Guangrong
2013-08-29  7:16   ` Gleb Natapov
2013-08-06 13:16 ` Xiao Guangrong
2013-08-08 17:38   ` Paolo Bonzini
2013-08-09  4:51     ` Xiao Guangrong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130803140943.3e09bbcda47693789df06315@gmail.com \
    --to=takuya.yoshikawa@gmail.com \
    --cc=avi.kivity@gmail.com \
    --cc=gleb@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=xiaoguangrong@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).