Re: [PATCH 1/1] mm: thp: kvm: fix memory corruption in KVM with THP enabled

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	qemu-devel@nongnu.org,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Li, Liang Z" <liang.z.li@intel.com>,
	Amit Shah <amit.shah@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 1/1] mm: thp: kvm: fix memory corruption in KVM with THP enabled
Date: Wed, 27 Apr 2016 18:18:34 +0300	[thread overview]
Message-ID: <20160427151834.GC22035@node.shutemov.name> (raw)
In-Reply-To: <20160427145957.GA9217@redhat.com>

On Wed, Apr 27, 2016 at 04:59:57PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 27, 2016 at 04:50:30PM +0300, Kirill A. Shutemov wrote:
> > I know nothing about kvm. How do you protect against pmd splitting between
> > get_user_pages() and the check?
> 
> get_user_pages_fast() runs fully lockless and unpins the page right
> away (we need a get_user_pages_fast without the FOLL_GET in fact to
> avoid a totally useless atomic_inc/dec!).
> 
> Then we take a lock that is also taken by
> mmu_notifier_invalidate_range_start. This way __split_huge_pmd will
> block in mmu_notifier_invalidate_range_start if it tries to run again
> (every other mmu notifier like mmu_notifier_invalidate_page will also
> block).
> 
> Then after we serialized against __split_huge_pmd through the MMU
> notifier KVM internal locking, we are able to tell if any mmu_notifier
> invalidate happened in the region just before get_user_pages_fast()
> was invoked, until we call PageCompoundTransMap and we actually map
> the shadow pagetable into the compound page with hugepage
> granularity (to allow real 2MB TLBs if guest also uses trans_huge_pmd
> in the guest pagetables).
> 
> After the shadow pagetable is mapped, we drop the internal MMU
> notifier lock and __split_huge_pmd mmu_notifier_invalidate_range_start
> can continue and drop the shadow pagetable that we just mapped in the
> above paragraph just before dropping the mmu notifier internal lock.
> 
> To be able to tell if any invalidate happened while
> get_user_pages_fast was running and until we grab the lock again and
> we start mapping the shadow pagtable we use:
> 
> 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> 	smp_rmb();
> 
> 	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
> 	    ^^^^^^^^^^^^ this is get_user_pages and does put_page on the page
> 	    		 and just returns the &pfn
> 	    		 this is why we need a get_user_pages_fast that won't
> 			 attempt to touch the page->_count at all! we can avoid
> 			 2 atomic ops for each secondary MMU fault that way
> 		return 0;
> 
> 	spin_lock(&vcpu->kvm->mmu_lock);
> 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> 		goto out_unlock;
> 	... here we check PageTransCompoundMap(pfn_to_page(pfn)) and
> 	map a 4k or 2MB shadow pagetable on "pfn" ...
> 
> 
> Note mmu_notifier_retry does the other side of the smp_rmb():
> 
> 	smp_rmb();
> 	if (kvm->mmu_notifier_seq != mmu_seq)
> 		return 1;
> 	return 0;

Okay, I see.

But do we really want to make PageTransCompoundMap() visiable beyond KVM
code? It looks like too KVM-specific.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	qemu-devel@nongnu.org,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Li, Liang Z" <liang.z.li@intel.com>,
	Amit Shah <amit.shah@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 1/1] mm: thp: kvm: fix memory corruption in KVM with THP enabled
Date: Wed, 27 Apr 2016 18:18:34 +0300	[thread overview]
Message-ID: <20160427151834.GC22035@node.shutemov.name> (raw)
In-Reply-To: <20160427145957.GA9217@redhat.com>

On Wed, Apr 27, 2016 at 04:59:57PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 27, 2016 at 04:50:30PM +0300, Kirill A. Shutemov wrote:
> > I know nothing about kvm. How do you protect against pmd splitting between
> > get_user_pages() and the check?
> 
> get_user_pages_fast() runs fully lockless and unpins the page right
> away (we need a get_user_pages_fast without the FOLL_GET in fact to
> avoid a totally useless atomic_inc/dec!).
> 
> Then we take a lock that is also taken by
> mmu_notifier_invalidate_range_start. This way __split_huge_pmd will
> block in mmu_notifier_invalidate_range_start if it tries to run again
> (every other mmu notifier like mmu_notifier_invalidate_page will also
> block).
> 
> Then after we serialized against __split_huge_pmd through the MMU
> notifier KVM internal locking, we are able to tell if any mmu_notifier
> invalidate happened in the region just before get_user_pages_fast()
> was invoked, until we call PageCompoundTransMap and we actually map
> the shadow pagetable into the compound page with hugepage
> granularity (to allow real 2MB TLBs if guest also uses trans_huge_pmd
> in the guest pagetables).
> 
> After the shadow pagetable is mapped, we drop the internal MMU
> notifier lock and __split_huge_pmd mmu_notifier_invalidate_range_start
> can continue and drop the shadow pagetable that we just mapped in the
> above paragraph just before dropping the mmu notifier internal lock.
> 
> To be able to tell if any invalidate happened while
> get_user_pages_fast was running and until we grab the lock again and
> we start mapping the shadow pagtable we use:
> 
> 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> 	smp_rmb();
> 
> 	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
> 	    ^^^^^^^^^^^^ this is get_user_pages and does put_page on the page
> 	    		 and just returns the &pfn
> 	    		 this is why we need a get_user_pages_fast that won't
> 			 attempt to touch the page->_count at all! we can avoid
> 			 2 atomic ops for each secondary MMU fault that way
> 		return 0;
> 
> 	spin_lock(&vcpu->kvm->mmu_lock);
> 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> 		goto out_unlock;
> 	... here we check PageTransCompoundMap(pfn_to_page(pfn)) and
> 	map a 4k or 2MB shadow pagetable on "pfn" ...
> 
> 
> Note mmu_notifier_retry does the other side of the smp_rmb():
> 
> 	smp_rmb();
> 	if (kvm->mmu_notifier_seq != mmu_seq)
> 		return 1;
> 	return 0;

Okay, I see.

But do we really want to make PageTransCompoundMap() visiable beyond KVM
code? It looks like too KVM-specific.

-- 
 Kirill A. Shutemov

WARNING: multiple messages have this Message-ID (diff)

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	qemu-devel@nongnu.org,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Li, Liang Z" <liang.z.li@intel.com>,
	Amit Shah <amit.shah@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/1] mm: thp: kvm: fix memory corruption in KVM with THP enabled
Date: Wed, 27 Apr 2016 18:18:34 +0300	[thread overview]
Message-ID: <20160427151834.GC22035@node.shutemov.name> (raw)
In-Reply-To: <20160427145957.GA9217@redhat.com>

On Wed, Apr 27, 2016 at 04:59:57PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 27, 2016 at 04:50:30PM +0300, Kirill A. Shutemov wrote:
> > I know nothing about kvm. How do you protect against pmd splitting between
> > get_user_pages() and the check?
> 
> get_user_pages_fast() runs fully lockless and unpins the page right
> away (we need a get_user_pages_fast without the FOLL_GET in fact to
> avoid a totally useless atomic_inc/dec!).
> 
> Then we take a lock that is also taken by
> mmu_notifier_invalidate_range_start. This way __split_huge_pmd will
> block in mmu_notifier_invalidate_range_start if it tries to run again
> (every other mmu notifier like mmu_notifier_invalidate_page will also
> block).
> 
> Then after we serialized against __split_huge_pmd through the MMU
> notifier KVM internal locking, we are able to tell if any mmu_notifier
> invalidate happened in the region just before get_user_pages_fast()
> was invoked, until we call PageCompoundTransMap and we actually map
> the shadow pagetable into the compound page with hugepage
> granularity (to allow real 2MB TLBs if guest also uses trans_huge_pmd
> in the guest pagetables).
> 
> After the shadow pagetable is mapped, we drop the internal MMU
> notifier lock and __split_huge_pmd mmu_notifier_invalidate_range_start
> can continue and drop the shadow pagetable that we just mapped in the
> above paragraph just before dropping the mmu notifier internal lock.
> 
> To be able to tell if any invalidate happened while
> get_user_pages_fast was running and until we grab the lock again and
> we start mapping the shadow pagtable we use:
> 
> 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> 	smp_rmb();
> 
> 	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
> 	    ^^^^^^^^^^^^ this is get_user_pages and does put_page on the page
> 	    		 and just returns the &pfn
> 	    		 this is why we need a get_user_pages_fast that won't
> 			 attempt to touch the page->_count at all! we can avoid
> 			 2 atomic ops for each secondary MMU fault that way
> 		return 0;
> 
> 	spin_lock(&vcpu->kvm->mmu_lock);
> 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> 		goto out_unlock;
> 	... here we check PageTransCompoundMap(pfn_to_page(pfn)) and
> 	map a 4k or 2MB shadow pagetable on "pfn" ...
> 
> 
> Note mmu_notifier_retry does the other side of the smp_rmb():
> 
> 	smp_rmb();
> 	if (kvm->mmu_notifier_seq != mmu_seq)
> 		return 1;
> 	return 0;

Okay, I see.

But do we really want to make PageTransCompoundMap() visiable beyond KVM
code? It looks like too KVM-specific.

-- 
 Kirill A. Shutemov

next prev parent reply	other threads:[~2016-04-27 15:18 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-27 12:04 [PATCH 1/1] mm: thp: kvm: fix memory corruption in KVM with THP enabled Andrea Arcangeli
2016-04-27 12:04 ` [Qemu-devel] " Andrea Arcangeli
2016-04-27 12:04 ` Andrea Arcangeli
2016-04-27 13:50 ` Kirill A. Shutemov
2016-04-27 13:50   ` [Qemu-devel] " Kirill A. Shutemov
2016-04-27 13:50   ` Kirill A. Shutemov
2016-04-27 14:59   ` Andrea Arcangeli
2016-04-27 14:59     ` [Qemu-devel] " Andrea Arcangeli
2016-04-27 14:59     ` Andrea Arcangeli
2016-04-27 15:18     ` Kirill A. Shutemov [this message]
2016-04-27 15:18       ` [Qemu-devel] " Kirill A. Shutemov
2016-04-27 15:18       ` Kirill A. Shutemov
2016-04-27 15:57       ` Andrea Arcangeli
2016-04-27 15:57         ` [Qemu-devel] " Andrea Arcangeli
2016-04-27 15:57         ` Andrea Arcangeli
2016-04-27 16:03         ` Andrea Arcangeli
2016-04-27 16:03           ` [Qemu-devel] " Andrea Arcangeli
2016-04-27 16:03           ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160427151834.GC22035@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=amit.shah@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=liang.z.li@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.