linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: gleb@redhat.com, avi.kivity@gmail.com, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable
Date: Thu, 8 Aug 2013 12:06:59 -0300	[thread overview]
Message-ID: <20130808150659.GA23075@amt.cnet> (raw)
In-Reply-To: <5201C7D9.20004@linux.vnet.ibm.com>

On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
> > On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
> >> Make sure we can see the writable spte before the dirt bitmap is visible
> >>
> >> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
> >> on the dirty bitmap, we should ensure the writable spte can be found in rmap
> >> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
> >> failed to write-protect the page
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/kvm/mmu.c | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > Can you explain why this is safe, with regard to the rule 
> > at edde99ce05290e50 ?
> 
> BTW, this log fixed this case:
> 
>      VCPU 0                KVM migration control
> 
>                            write-protects all pages
> #Pf happen then the page
> become writable, set dirty
> bit on the bitmap
> 
>                           swap the bitmap, current bitmap is empty
> 
> write the page (no dirty log)
> 
>                           stop the guest and push
>                           the remaining dirty pages
> Stopped
>                           See current bitmap is empty that means
>                           no page is dirty.
> > 
> > "The rule is that all pages are either dirty in the current bitmap,
> > or write-protected, which is violated here."
> 
> Actually, this rule is not complete true, there's the 3th case:
> the window between write guest page and set dirty bitmap is valid.
> In that window, page is write-free and not dirty logged.
> 
> This case is based on the fact that at the final step of live migration,
> kvm should stop the guest and push the remaining dirty pages to the
> destination.
> 
> They're some examples in the current code:
> example 1, in fast_pf_fix_direct_spte():
> 	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
> 		/* The window in here... */
> 		mark_page_dirty(vcpu->kvm, gfn);
> 
> example 2, in kvm_write_guest_page():
> 	r = __copy_to_user((void __user *)addr + offset, data, len);
> 	if (r)
> 		return -EFAULT;
> 	/*
> 	 * The window is here, the page is dirty but not logged in
>          * The bitmap.
> 	 */
> 	mark_page_dirty(kvm, gfn);
> 	return 0;
> 
> > 
> > Overall, please document what is the required order of operations for
> > both set_spte and get_dirty_log and why this order is safe (right on top
> > of the code).
> 
> Okay.
> 
> The order we required here is, we should 1) set spte to writable __before__
> set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
> bitmap.
> 
> The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
> The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
> writable when we do lockless write-protection, otherwise these cases will happen
> 
> VCPU 0					kvm ioctl doing get-dirty-pages
> 
> mark_page_dirty(gfn) which
> set the gfn on the dirty maps
>                                       mask = xchg(dirty_bitmap, 0)
> 
>                                       walk all gfns which set on "mask" and
>                                       locklessly write-protect the gfn,
>                                       then walk rmap, see no spte on that rmap
> 	
> 
> add the spte into rmap
> 
> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
> 
> Or:
> 
> VCPU 0					kvm ioctl doing get-dirty-pages
> 
> mark_page_dirty(gfn) which
> set the gfn on the dirty maps
> 
> add spte into rmap
>                                       mask = xchg(dirty_bitmap, 0)
> 
>                                       walk all gfns which set on "mask" and
>                                       locklessly write-protect the gfn,
>                                       then walk rmap, see spte is on the ramp
>                                       but it is readonly or nonpresent.
> 	
> Mark spte writable
> 
> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
> 
> Hopefully, i have clarified it, if you have any questions, please let me know.

Yes, partially. Please on top of the explanation above, have something along
the lines of

"The flush IPI assumes that a thread switch happens in this order"
comment at arch/x86/mm/tlb.c

"With get_user_pages_fast, we walk down the pagetables without taking"
comment at arch/x86/mm/gup.c


What about the relation between read-only spte and respective TLB flush?
That is:

vcpu0					vcpu1

lockless write protect
mark spte read-only
					either some mmu-lockless path or not
					write protect page:
					find read-only page
					assumes tlb is flushed
				
kvm_flush_remote_tlbs


In general, i think write protection is a good candidate for lockless operation.
However, i would also be concerned about the larger problem of mmu-lock 
contention and how to solve it. Did you ever consider splitting it?

> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 35d4b50..0fe56ad 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> >>  		}
> >>  	}
> >>  
> >> -	if (pte_access & ACC_WRITE_MASK)
> >> -		mark_page_dirty(vcpu->kvm, gfn);
> >> -
> >>  set_pte:
> >>  	if (mmu_spte_update(sptep, spte))
> >>  		kvm_flush_remote_tlbs(vcpu->kvm);
> > 
> > Here, there is a modified guest page without dirty log bit set (think
> > another vcpu writing to the page via this spte).
> 
> This is okay since we call mark_page_dirty(vcpu->kvm, gfn) after that, this
> is the same case as fast_pf_fix_direct_spte() and i have explained why it is
> safe above.

Right. 

  reply	other threads:[~2013-08-08 15:07 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-30 13:01 [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-07-30 13:01 ` [PATCH 01/12] KVM: MMU: remove unused parameter Xiao Guangrong
2013-08-29  7:22   ` Gleb Natapov
2013-07-30 13:02 ` [PATCH 02/12] KVM: MMU: properly check last spte in fast_page_fault() Xiao Guangrong
2013-07-30 13:02 ` [PATCH 03/12] KVM: MMU: lazily drop large spte Xiao Guangrong
2013-08-02 14:55   ` Marcelo Tosatti
2013-08-02 15:42     ` Xiao Guangrong
2013-08-02 20:27       ` Marcelo Tosatti
2013-08-02 22:56         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable Xiao Guangrong
2013-07-30 13:26   ` Paolo Bonzini
2013-07-31  7:25     ` Xiao Guangrong
2013-08-07  1:48   ` Marcelo Tosatti
2013-08-07  4:06     ` Xiao Guangrong
2013-08-08 15:06       ` Marcelo Tosatti [this message]
2013-08-08 16:26         ` Xiao Guangrong
2013-11-20  0:29       ` Marcelo Tosatti
2013-11-20  0:35         ` Marcelo Tosatti
2013-11-20 14:20         ` Xiao Guangrong
2013-11-20 19:47           ` Marcelo Tosatti
2013-11-21  4:26             ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page Xiao Guangrong
2013-07-30 13:27   ` Paolo Bonzini
2013-07-31  7:33     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified Xiao Guangrong
2013-08-28  7:23   ` Gleb Natapov
2013-08-28  7:50     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list Xiao Guangrong
2013-08-28  8:12   ` Gleb Natapov
2013-08-28  8:37     ` Xiao Guangrong
2013-08-28  8:58       ` Gleb Natapov
2013-08-28  9:19         ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 08/12] KVM: MMU: introduce nulls desc Xiao Guangrong
2013-08-28  8:40   ` Gleb Natapov
2013-08-28  8:54     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker Xiao Guangrong
2013-08-28  9:20   ` Gleb Natapov
2013-08-28  9:33     ` Xiao Guangrong
2013-08-28  9:46       ` Gleb Natapov
2013-08-28 10:13         ` Xiao Guangrong
2013-08-28 10:49           ` Gleb Natapov
2013-08-28 12:15             ` Xiao Guangrong
2013-08-28 13:36               ` Gleb Natapov
2013-08-29  6:50                 ` Xiao Guangrong
2013-08-29  9:08                   ` Gleb Natapov
2013-08-29  9:31                     ` Xiao Guangrong
2013-08-29  9:51                       ` Gleb Natapov
2013-08-29 11:26                         ` Xiao Guangrong
2013-08-30 11:38                           ` Gleb Natapov
2013-09-02  7:02                             ` Xiao Guangrong
2013-08-29  9:31                   ` Gleb Natapov
2013-08-29 11:33                     ` Xiao Guangrong
2013-08-29 12:02                       ` Xiao Guangrong
2013-08-30 11:44                         ` Gleb Natapov
2013-09-02  8:50                           ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread Xiao Guangrong
2013-08-07 13:09   ` Takuya Yoshikawa
2013-08-07 13:19     ` Xiao Guangrong
2013-08-29  9:10   ` Gleb Natapov
2013-08-29  9:25     ` Xiao Guangrong
2013-07-30 13:02 ` [PATCH 11/12] KVM: MMU: locklessly write-protect the page Xiao Guangrong
2013-07-30 13:02 ` [PATCH 12/12] KVM: MMU: clean up spte_write_protect Xiao Guangrong
2013-07-30 13:11 ` [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect Xiao Guangrong
2013-08-03  5:09 ` Takuya Yoshikawa
2013-08-04 14:15   ` Xiao Guangrong
2013-08-29  7:16   ` Gleb Natapov
2013-08-06 13:16 ` Xiao Guangrong
2013-08-08 17:38   ` Paolo Bonzini
2013-08-09  4:51     ` Xiao Guangrong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130808150659.GA23075@amt.cnet \
    --to=mtosatti@redhat.com \
    --cc=avi.kivity@gmail.com \
    --cc=gleb@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=xiaoguangrong@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).