From: Marcelo Tosatti <marcelo@kvack.org>
To: Andrea Arcangeli <andrea@qumranet.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>,
kvm-devel <kvm-devel@lists.sourceforge.net>,
Avi Kivity <avi@qumranet.com>
Subject: Re: KVM: MMU: add KVM_ZAP_GFN ioctl
Date: Fri, 21 Mar 2008 18:23:41 -0300 [thread overview]
Message-ID: <20080321212341.GA3897@dmt> (raw)
In-Reply-To: <20080321155650.GG7822@v2.random>
On Fri, Mar 21, 2008 at 04:56:50PM +0100, Andrea Arcangeli wrote:
> On Fri, Mar 21, 2008 at 10:37:00AM -0300, Marcelo Tosatti wrote:
> > This is not the final put_page().
> >
> > Remote TLB's are flushed here, after rmap_remove:
> >
> > + if (nuked)
> > + kvm_flush_remote_tlbs(kvm);
> >
> > This ioctl is called before zap_page_range() is executed through
> > sys_madvise(MADV_DONTNEED) to remove the page in question.
> >
> > We know that the guest will not attempt to fault in the gfn because
> > the virtio balloon driver is synchronous (it will only attempt to
> > release that page back to the guest OS once rmap_nuke+zap_page_range has
> > finished).
> >
> > Can you be more verbose?
>
> Sure.
>
> 1) even if you run madvise(MADV_DONTNEED) after KVM_ZAP_GFN, the anon
> page can be released by the VM at any time without any kvm-aware
> lock (there will be a swap reference to it, no any more page_count
> references leading to memory corruption in the host in presence of
> memory pressure). This is purely theoretical of course, not sure if
> timings or probabilities allows for reproducing this in real life.
If there are any active shadow mappings to a page there is a guarantee
that there is a valid linux pte mapping pointing at it. So page_count ==
1 + nr_sptes.
So the theoretical race you're talking about is:
CPU0 CPU1
spte = rmap_next(kvm, rmapp, NULL);
while (spte) {
BUG_ON(!spte);
BUG_ON(!(*spte & PT_PRESENT_MASK));
rmap_printk("rmap_nuke: spte %p %llx\n", spte, *spte);
rmap_remove(kvm, spte);
set_shadow_pte(spte, shadow_trap_nonpresent_pte);
nuked = 1;
spte = rmap_next(kvm, rmapp, spte);
}
<---------- try_to_unmap_one()
page is now free
page allocated for other
purposes
if (nuked)
kvm_flush_remote_tlbs(kvm);
And some other VCPU with the TLB cached writes to the now freed (and
possibly allocated to another purpose) page.
This case is safe because the path that frees a pte and subsequently
a page will take care of flushing the TLB of any remote CPU's that
possibly have it cached (before freeing the page, of course).
ptep_clear_flush->flush_tlb_page.
Am I missing something?
> 2) not sure what you mean with synchronous, do you mean single
> threaded? I can't see how it can be single threaded (does
> ballooning stops all other vcpus?).
No, I mean synchronous as in that no other vcpu will attempt to fault
that _particular gfn_ in between KVM_ZAP_GFN and madvise.
> Why are you taking the mmu_lock around rmap_nuke if no other vcpu
> can take any page fault and call into get_user_pages in between
> KVM_ZAP_GFN and madvise?
Other vcpu's can take page faults and call into get_user_pages, but not
for the gfn KVM_ZAP_GFN is operating on, because it has been allocated
by the balloon driver.
So we need mmu_lock to protect against concurrent shadow page and rmap
operations.
> As far as I
> can tell the only possible safe ordering is madvise; KVM_ZAP_GFN,
> which is emulating the mmu notifier behavior incidentally.
>
> Note that the rmap_remove smp race (also note here smp race means
> smp-host race, it will trigger even if guest is UP) might be a generic
> issue with the rmap_remove logic. I didn't analyze all the possible
> rmap_remove callers yet (this was in my todo list), I just made sure
> that my code would be smp safe.
As detailed above, we have a guarantee that there is a live linux pte
by the time rmap_remove() nukes a shadow pte.
> > By the way, I don't see invalidate_begin/invalidate_end hooks in the KVM
> > part of MMU notifiers V9 patch? (meaning that zap_page_range will not zap
> > the spte's for the pages in question).
>
> range_begin isn't needed. range_begin is needed only by secondary mmu
> drivers that aren't reference counting the pages. The _end callback is
> below. It could be improved to skip the whole range in a single browse
> of the memslots instead of browsing it for each page in the range. The
> mmu notifiers aren't merged and this code may still require changes in
> terms of API if EMM is merged instead of #v9 (hope not), so I tried to
> keep it simple.
Oh, I missed that. Nice.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
next prev parent reply other threads:[~2008-03-21 21:23 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-12 22:29 KVM: MMU: add KVM_ZAP_GFN ioctl Marcelo Tosatti
2008-03-20 12:09 ` Avi Kivity
2008-03-21 11:42 ` Andrea Arcangeli
2008-03-21 13:37 ` Marcelo Tosatti
2008-03-21 15:56 ` Andrea Arcangeli
2008-03-21 21:23 ` Marcelo Tosatti [this message]
2008-03-24 6:54 ` Andrea Arcangeli
2008-03-26 12:31 ` Andrea Arcangeli
2008-03-23 8:50 ` Avi Kivity
2008-03-23 20:23 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080321212341.GA3897@dmt \
--to=marcelo@kvack.org \
--cc=andrea@qumranet.com \
--cc=avi@qumranet.com \
--cc=kvm-devel@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox