All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>,
	avi@redhat.com, kvm@vger.kernel.org,
	xiaoguangrong@linux.vnet.ibm.com
Subject: Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
Date: Tue, 14 Feb 2012 15:29:47 -0200	[thread overview]
Message-ID: <20120214172947.GA22362@amt.cnet> (raw)
In-Reply-To: <20120214171044.GK9440@redhat.com>

On Tue, Feb 14, 2012 at 06:10:44PM +0100, Andrea Arcangeli wrote:
> On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote:
> > Other threads may process the same page in that small window and skip
> > TLB flush and then return before these functions do flush.
> 
> It's correct to flush the shadow MMU TLB without the mmu_lock only in
> the context of mmu notifier methods. So the below while won't hurt,
> it's performance regression and shouldn't be applied (and
> it obfuscates the code by not being strict anymore).
> 
> To the contrary every other place that does a shadow/secondary MMU smp
> tlb flush _must_ happen inside the mmu_lock, otherwise the
> serialization isn't correct anymore against the very below mmu_lock in
> the below quoted patch taken by
> kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start.
> 
> The explanation is in commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4.
> 
> I'll try to explain it more clearly: the moment you drop mmu_lock,
> pages can be freed. So if you invalidate a spte in any place inside
> the KVM code (except the mmu notifier methods where a reference of the
> page is implicitly hold by the caller and so the page can't go away
> under a mmu notifier method by design), then the below
> kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start
> won't get their need_tlb_flush set anymore, and they won't run the tlb
> flush before freeing the page.
> 
> So every other place (except mmu notifier) must flush the secondary
> MMU smp tlb _before_ releasing the mmu_lock.
> 
> Only mmu notifier is safe to flush the secondary MMU TLB _after_
> releasing the mmu_lock.
> 
> If we changed the mmu notifier methods to unconditionally flush the
> shadow TLB (regardless if a spte was present or not), we might not
> need to hold the mmu_lock in every tlb flush outside the context of
> the mmu notifier methods. But then the mmu notifier methods would
> become more expensive, I didn't evaluate fully what would be the side
> effects of such a change. Also note, only the
> kvm_mmu_notifier_invalidate_page and
> kvm_mmu_notifier_invalidate_range_start would need to do that, because
> they're the only two where the page reference gets dropped.
> 
> Even shorter: because the mmu notifier a implicit reference on the
> page exists and is hold by the caller, they can flush outside the
> mmu_lock. Every other place in KVM only holds an implicit valid
> reference on the page only as long as you hold the mmu_lock, or while
> a spte is still established.
> 
> Well it's not easy logic so it's not surprising it wasn't totally
> clear.
> 
> It's probably not heavily documented, and the fact you changed it
> still is still good so we refresh our minds on the exact rules of mmu
> notifier locking, thanks!

The problem the patch is fixing is not related to page freeing, but
rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d
(replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"):


During protecting pages for dirty logging, other threads may also try
to protect a page in mmu_sync_children() or kvm_mmu_get_page().

In such a case, because get_dirty_log releases mmu_lock before flushing
TLB's, the following race condition can happen:

  A (get_dirty_log)     B (another thread)

  lock(mmu_lock)
  clear pte.w
  unlock(mmu_lock)
                        lock(mmu_lock)
                        pte.w is already cleared
                        unlock(mmu_lock)
                        skip TLB flush
                        return
  ...
  TLB flush

Though thread B assumes the page has already been protected when it
returns, the remaining TLB entry will break that assumption.


> 
> Andrea
> 
> > 
> > Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> > ---
> >  virt/kvm/kvm_main.c |   19 ++++++++++---------
> >  1 files changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 470e305..2b4bc77 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  	 */
> >  	idx = srcu_read_lock(&kvm->srcu);
> >  	spin_lock(&kvm->mmu_lock);
> > +
> >  	kvm->mmu_notifier_seq++;
> >  	need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> > -
> >  	/* we've to flush the tlb before the pages can be freed */
> >  	if (need_tlb_flush)
> >  		kvm_flush_remote_tlbs(kvm);
> >  
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> >  }
> >  
> >  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  	for (; start < end; start += PAGE_SIZE)
> >  		need_tlb_flush |= kvm_unmap_hva(kvm, start);
> >  	need_tlb_flush |= kvm->tlbs_dirty;
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> > -
> >  	/* we've to flush the tlb before the pages can be freed */
> >  	if (need_tlb_flush)
> >  		kvm_flush_remote_tlbs(kvm);
> > +
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> >  }
> >  
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> >  
> >  	idx = srcu_read_lock(&kvm->srcu);
> >  	spin_lock(&kvm->mmu_lock);
> > -	young = kvm_age_hva(kvm, address);
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> >  
> > +	young = kvm_age_hva(kvm, address);
> >  	if (young)
> >  		kvm_flush_remote_tlbs(kvm);
> >  
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> > +
> >  	return young;
> >  }
> >  
> > -- 
> > 1.7.5.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2012-02-14 17:32 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-10  6:28 [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock Takuya Yoshikawa
2012-02-10  6:29 ` [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() " Takuya Yoshikawa
2012-02-10  6:55   ` Xiao Guangrong
2012-02-10  7:21     ` Takuya Yoshikawa
2012-02-10  7:42       ` Xiao Guangrong
2012-02-14  4:36         ` Takuya Yoshikawa
2012-02-14  4:56           ` Takuya Yoshikawa
2012-02-14 17:21             ` Andrea Arcangeli
2012-02-10  7:52 ` [PATCH 1/2] KVM: mmu_notifier: Flush TLBs " Xiao Guangrong
2012-02-13  6:00   ` Takuya Yoshikawa
2012-02-14 17:27   ` Andrea Arcangeli
2012-02-10 17:26 ` Marcelo Tosatti
2012-02-14 17:10 ` Andrea Arcangeli
2012-02-14 17:29   ` Marcelo Tosatti [this message]
2012-02-14 18:53     ` Andrea Arcangeli
2012-02-14 19:43       ` Marcelo Tosatti
2012-02-15  9:18         ` Avi Kivity
2012-02-15  9:47           ` Avi Kivity
2012-02-15 11:37             ` Xiao Guangrong
2012-02-15 14:07               ` Avi Kivity
2012-02-15 19:16                 ` Andrea Arcangeli
2012-02-16  4:50                 ` Xiao Guangrong
2012-02-16 11:57                   ` Avi Kivity
2012-02-17  2:36                     ` Xiao Guangrong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120214172947.GA22362@amt.cnet \
    --to=mtosatti@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=avi@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=xiaoguangrong@linux.vnet.ibm.com \
    --cc=yoshikawa.takuya@oss.ntt.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.