From mboxrd@z Thu Jan  1 00:00:00 1970
From: Gleb Natapov <gleb@minantech.com>
Subject: Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
Date: Tue, 1 Jul 2014 09:27:19 +0300
Message-ID: <20140701062719.GK18167@minantech.com>
References: <20140618231203.846608908@amt.cnet>
 <20140618231521.718959400@amt.cnet>
 <20140619081719.GD10948@minantech.com>
 <20140619184031.GC32410@amt.cnet>
 <20140620104610.GD20764@minantech.com>
 <20140630205902.GB26566@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: kvm@vger.kernel.org, ak@linux.intel.com, pbonzini@redhat.com,
	xiaoguangrong@linux.vnet.ibm.com, avi@cloudius-systems.com
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-wg0-f42.google.com ([74.125.82.42]:59846 "EHLO
	mail-wg0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750811AbaGAG1Y (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 1 Jul 2014 02:27:24 -0400
Received: by mail-wg0-f42.google.com with SMTP id z12so8821659wgg.1
        for <kvm@vger.kernel.org>; Mon, 30 Jun 2014 23:27:23 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <20140630205902.GB26566@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Mon, Jun 30, 2014 at 05:59:02PM -0300, Marcelo Tosatti wrote:
> On Fri, Jun 20, 2014 at 01:46:10PM +0300, Gleb Natapov wrote:
> > On Thu, Jun 19, 2014 at 03:40:31PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> > > > On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosatti@redhat.com wrote:
> > > > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > > > deleting a pinned spte.
> > > > > 
> > > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > > > 
> > > > > ---
> > > > >  arch/x86/kvm/mmu.c |    3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > > 
> > > > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > > > ===================================================================
> > > > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-13 16:50:50.040140594 -0300
> > > > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-13 16:51:05.620104451 -0300
> > > > > @@ -1247,6 +1247,9 @@
> > > > >  		spte &= ~SPTE_MMU_WRITEABLE;
> > > > >  	spte = spte & ~PT_WRITABLE_MASK;
> > > > >  
> > > > > +	if (is_pinned_spte(spte))
> > > > > +		mmu_reload_pinned_vcpus(kvm);
> > > > > +
> > > > Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it anyway
> > > > on the next vmentry. Isn't it better to just report all pinned pages as dirty alway.
> > > 
> > > That was the initial plan, however its awkward to stop vcpus, execute
> > > get_dirty_log twice, and have pages marked as dirty on the second
> > > execution.
> > Indeed, but I think it may happen today with vhost (or even other devices
> > that emulate dma), so userspace should be able to deal with it already.
> > 
> > > 
> > > That is, it is in "incorrect" to report pages as dirty when they are
> > > clean.
> > In case of PEBS area we cannot know. CPU writes there directly without
> > KVM even knowing about it so the only sane thing is to assume that after
> > a vmentry PEBS area is dirty. This is what happens with this patch BTW,
> > mmu_reload_pinned_vcpus() will force ->page_fault(FERR_WRITE) which will
> > mark all pinned pages as dirty even if pages are never written to. You
> > can achieve the same by having vcpu->pinned_page_dirty which is set to
> > true on each vmentry and to false on each GET_DIRTY_LOG.
> > 
> > > 
> > > Moreover, if the number of pinned pages is larger than the dirty
> > > threshold to stop VM and migrate, you'll never migrate. If vcpus are
> > > in HLT and don't VM-enter immediately, the pages should not be refaulted
> > > right away.
> > We should not allow that many pinned pages for security reasons. And
> > having a lot of page to fault in on a next vmentry after each
> > GET_DIRTY_LOG will slow down a guest during migration considerably.
> > 
> > > 
> > > Do you think the optimization is worthwhile ?
> > > 
> > I think it's simpler, but even if we will go with your current approach
> > it should be improved: there is no point sending IPI to all vcpus in
> > spte_write_protect() like you do here since the IPI will be send anyway at
> > the end of write protect because of ptes writable->nonwritable transition,
> > so all you have to do here is to set KVM_REQ_MMU_RELOAD, not need for IPI.
> 
> No, you have to make sure the vcpu is out of guest mode.
> 
> > In fact this makes me thinking that write protecting pinned page here is
> > incorrect because old translation may not be in TLB and if CPU will try to
> > write PEBS entry after pte is write protected but before MMU is reloaded
> > it will encounter non writable pte and god knows what will happen, SDM
> > does not tell. So spte_write_protect() should be something like that IMO:
> 
> The vcpu will never see a read-only spte because the VM-exit (due to
> IPI) guarantees vcpu is outside of guest mode _before_ it is write
> protected.
Right. Now I see why you absolutely have to send IPI in mmu_reload_pinned_vcpus()
before marking pte as read only. And kvm->mmu_lock is what will prevent vcpu from
re-entering guest mode again before pte is marked read only, right?

> 
> So i ask you: do you still hold the "current approach should be
> improved" position ?
> 
As I said IMO what I proposed is much simpler and not as tricky as what you have here.
It also has an advantage of not slowing down next guest entry after GET_DIRTY_LOG because
it does not require mmu reload and page_faulting in pinned pages.

--
			Gleb.