* KVM swapping with mmu notifiers #v5
@ 2008-01-31 17:30 Andrea Arcangeli
[not found] ` <20080131173041.GO7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2008-01-31 17:30 UTC (permalink / raw)
To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Christoph Lameter
Hi,
Usual patch but adapted to mmu notifier #v5, works fine here as
expected.
I doubt Christoph's V4 was close to final yet, GRU wasn't covered at
all yet, not even mremap was covered at all (nor XPMEM nor GRU) in V4.
The first workable APIs for XPMEM (to close the SMP race I explained
since export-noifiers #v1) is just an idea from last night... and for
the first time xpmem I think may work safe.
I think my #v5 is small enough, should already fit KVM and GRU, it
should provide an API that allows optimization and extension over
time, and it can be extended to support XPMEM once that will work in
practice. I really think it's better idea to be able at least to test
some code before pushing in mainline some broad VM visible API. This
is what I did with KVM infact. Once KVM was solid swapping 3G over 1G
of ram I pushed the mmu notifiers to lkml.
Being dependent on XPMEM support being merged, to merge KVM/GRU
doesn't sound a good idea. My patch provides no overhead with
MMU_NOTIFIER=n too. Hope Christoph agrees with my proposal to use #v5
as the mmu core and to merge it in mainline with higher priority, to
mostly close the discussions on KVM and GRU (optimizations remains
possible) and to keep working incrementally on XPMEM and to push it in
mainline whenever you verified that it doesn't crash at runtime and
that you don't need yet another change of API.
Signed-off-by: Andrea Arcangeli <andrea-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 4086080..c527d7d 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -18,6 +18,7 @@ config KVM
tristate "Kernel-based Virtual Machine (KVM) support"
depends on ARCH_SUPPORTS_KVM && EXPERIMENTAL
select PREEMPT_NOTIFIERS
+ select MMU_NOTIFIER
select ANON_INODES
---help---
Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c85b904..adb20de 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -532,6 +532,110 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
kvm_flush_remote_tlbs(kvm);
}
+static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
+{
+ struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT);
+ get_page(page);
+ rmap_remove(kvm, spte);
+ set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+ kvm_flush_remote_tlbs(kvm);
+ __free_page(page);
+}
+
+static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+ u64 *spte, *curr_spte;
+
+ spte = rmap_next(kvm, rmapp, NULL);
+ while (spte) {
+ BUG_ON(!(*spte & PT_PRESENT_MASK));
+ rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+ curr_spte = spte;
+ spte = rmap_next(kvm, rmapp, spte);
+ kvm_unmap_spte(kvm, curr_spte);
+ }
+}
+
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+ int i;
+
+ /*
+ * If mmap_sem isn't taken, we can look the memslots with only
+ * the mmu_lock by skipping over the slots with userspace_addr == 0.
+ */
+ spin_lock(&kvm->mmu_lock);
+ for (i = 0; i < kvm->nmemslots; i++) {
+ struct kvm_memory_slot *memslot = &kvm->memslots[i];
+ unsigned long start = memslot->userspace_addr;
+ unsigned long end;
+
+ /* mmu_lock protects userspace_addr */
+ if (!start)
+ continue;
+
+ end = start + (memslot->npages << PAGE_SHIFT);
+ if (hva >= start && hva < end) {
+ gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+ kvm_unmap_rmapp(kvm, &memslot->rmap[gfn_offset]);
+ }
+ }
+ spin_unlock(&kvm->mmu_lock);
+}
+
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+ u64 *spte;
+ int young = 0;
+
+ spte = rmap_next(kvm, rmapp, NULL);
+ while (spte) {
+ int _young;
+ u64 _spte = *spte;
+ BUG_ON(!(_spte & PT_PRESENT_MASK));
+ _young = _spte & PT_ACCESSED_MASK;
+ if (_young) {
+ young = !!_young;
+ set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK);
+ }
+ spte = rmap_next(kvm, rmapp, spte);
+ }
+ return young;
+}
+
+int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+{
+ int i;
+ int young = 0;
+
+ /*
+ * If mmap_sem isn't taken, we can look the memslots with only
+ * the mmu_lock by skipping over the slots with userspace_addr == 0.
+ */
+ spin_lock(&kvm->mmu_lock);
+ for (i = 0; i < kvm->nmemslots; i++) {
+ struct kvm_memory_slot *memslot = &kvm->memslots[i];
+ unsigned long start = memslot->userspace_addr;
+ unsigned long end;
+
+ /* mmu_lock protects userspace_addr */
+ if (!start)
+ continue;
+
+ end = start + (memslot->npages << PAGE_SHIFT);
+ if (hva >= start && hva < end) {
+ gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+ young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]);
+ }
+ }
+ spin_unlock(&kvm->mmu_lock);
+
+ if (young)
+ kvm_flush_remote_tlbs(kvm);
+
+ return young;
+}
+
#ifdef MMU_DEBUG
static int is_empty_shadow_page(u64 *spt)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f94a0b..a99c2ea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3167,6 +3167,45 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
free_page((unsigned long)vcpu->arch.pio_data);
}
+static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
+{
+ struct kvm_arch *kvm_arch;
+ kvm_arch = container_of(mn, struct kvm_arch, mmu_notifier);
+ return container_of(kvm_arch, struct kvm, arch);
+}
+
+void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ BUG_ON(mm != kvm->mm);
+ kvm_unmap_hva(kvm, address);
+}
+
+void kvm_mmu_notifier_invalidate_pages(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ for (; start < end; start += PAGE_SIZE)
+ kvm_mmu_notifier_invalidate_page(mn, mm, start);
+}
+
+int kvm_mmu_notifier_age_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ BUG_ON(mm != kvm->mm);
+ return kvm_age_hva(kvm, address);
+}
+
+static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+ .invalidate_page = kvm_mmu_notifier_invalidate_page,
+ .invalidate_pages = kvm_mmu_notifier_invalidate_pages,
+ .age_page = kvm_mmu_notifier_age_page,
+};
+
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
@@ -3176,6 +3215,9 @@ struct kvm *kvm_arch_create_vm(void)
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+
return kvm;
}
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index d6db0de..72a7ff4 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -13,6 +13,7 @@
#include <linux/types.h>
#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
#include <linux/kvm.h>
#include <linux/kvm_para.h>
@@ -287,6 +288,8 @@ struct kvm_arch{
int round_robin_prev_vcpu;
unsigned int tss_addr;
struct page *apic_access_page;
+
+ struct mmu_notifier mmu_notifier;
};
struct kvm_vm_stat {
@@ -404,6 +407,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu);
int kvm_mmu_setup(struct kvm_vcpu *vcpu);
void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte);
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
+int kvm_age_hva(struct kvm *kvm, unsigned long hva);
int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
void kvm_mmu_zap_all(struct kvm *kvm);
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: KVM swapping with mmu notifiers #v5
[not found] ` <20080131173041.GO7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
@ 2008-01-31 20:21 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801311219100.25477-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Christoph Lameter @ 2008-01-31 20:21 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
On Thu, 31 Jan 2008, Andrea Arcangeli wrote:
> I doubt Christoph's V4 was close to final yet, GRU wasn't covered at
> all yet, not even mremap was covered at all (nor XPMEM nor GRU) in V4.
The GRU not covered? Why would you think that way? mremap is covered
because of the callbacks in unmap_region().
> Being dependent on XPMEM support being merged, to merge KVM/GRU
> doesn't sound a good idea. My patch provides no overhead with
> MMU_NOTIFIER=n too. Hope Christoph agrees with my proposal to use #v5
> as the mmu core and to merge it in mainline with higher priority, to
> mostly close the discussions on KVM and GRU (optimizations remains
> possible) and to keep working incrementally on XPMEM and to push it in
> mainline whenever you verified that it doesn't crash at runtime and
> that you don't need yet another change of API.
Please read the comments on your #5. #5 makes wrong assumptions about the
nature of pte locks. As a result locking is broken.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: KVM swapping with mmu notifiers #v5
[not found] ` <Pine.LNX.4.64.0801311219100.25477-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
@ 2008-01-31 23:32 ` Andrea Arcangeli
[not found] ` <20080131233225.GR7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2008-01-31 23:32 UTC (permalink / raw)
To: Christoph Lameter; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
On Thu, Jan 31, 2008 at 12:21:34PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Andrea Arcangeli wrote:
>
> > I doubt Christoph's V4 was close to final yet, GRU wasn't covered at
> > all yet, not even mremap was covered at all (nor XPMEM nor GRU) in V4.
>
> The GRU not covered? Why would you think that way? mremap is covered
> because of the callbacks in unmap_region().
I wouldn't be so sure. ptep_clear_flush is called for a reason and you
have zero range_start _before_ the ptep_clear_flush. If you're right
it means ptep_clear_flush there is called for no good reason and it
should be replaced with ptep_get_and_clear and eliminate an
unnecessary tlb flush from the mremap fast path, and a tlb flush that
will cost an huge lot with threads, an IPI for every single PTE in
SMP! So you may be right, but then it means we found a really stupid
spot to optimize in mremap. (I've to say I've already found a silly
thing in the ptep_ that sets the accessed bitflag, pte entries w/o
accessed bit set can't be tlb-cached, this is an hardware thing, so
the tlb flush there on x86 is a total waste of ipis)
> > Being dependent on XPMEM support being merged, to merge KVM/GRU
> > doesn't sound a good idea. My patch provides no overhead with
> > MMU_NOTIFIER=n too. Hope Christoph agrees with my proposal to use #v5
> > as the mmu core and to merge it in mainline with higher priority, to
> > mostly close the discussions on KVM and GRU (optimizations remains
> > possible) and to keep working incrementally on XPMEM and to push it in
> > mainline whenever you verified that it doesn't crash at runtime and
> > that you don't need yet another change of API.
>
> Please read the comments on your #5. #5 makes wrong assumptions about the
> nature of pte locks. As a result locking is broken.
You misunderstood the locking, #v5 is obviously safe. If #v5 wasn't
safe, any SMP with >4 cpus would crash already regardless of my changes...
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: KVM swapping with mmu notifiers #v5
[not found] ` <20080131233225.GR7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
@ 2008-02-01 1:38 ` Christoph Lameter
0 siblings, 0 replies; 4+ messages in thread
From: Christoph Lameter @ 2008-02-01 1:38 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
On Fri, 1 Feb 2008, Andrea Arcangeli wrote:
> > The GRU not covered? Why would you think that way? mremap is covered
> > because of the callbacks in unmap_region().
>
> I wouldn't be so sure. ptep_clear_flush is called for a reason and you
> have zero range_start _before_ the ptep_clear_flush. If you're right
> it means ptep_clear_flush there is called for no good reason and it
> should be replaced with ptep_get_and_clear and eliminate an
Ok. I see the point. Will post in response to the other thread.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2008-02-01 1:38 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-31 17:30 KVM swapping with mmu notifiers #v5 Andrea Arcangeli
[not found] ` <20080131173041.GO7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-01-31 20:21 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0801311219100.25477-RYO/mD75kfhx2SFC9UQUAuF7EQX82lMiAL8bYrjMMd8@public.gmane.org>
2008-01-31 23:32 ` Andrea Arcangeli
[not found] ` <20080131233225.GR7185-lysg2Xt5kKMAvxtiuMwx3w@public.gmane.org>
2008-02-01 1:38 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox