Re: [PATCH 1/2] kvm: Fix mmu_notifier release race

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Suzuki K Poulose <Suzuki.Poulose@arm.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>,
	pbonzini@redhat.com, christoffer.dall@linaro.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org,
	marc.zyngier@arm.com, mark.rutland@arm.com,
	andreyknvl@google.com, "Will Deacon" <Will.Deacon@arm.com>
Subject: Re: [PATCH 1/2] kvm: Fix mmu_notifier release race
Date: Wed, 26 Apr 2017 09:17:51 -0700	[thread overview]
Message-ID: <20170426161751.GV3956@linux.vnet.ibm.com> (raw)
In-Reply-To: <c1232b1d-ad82-794b-1b86-4d0cc0d4cd7f@arm.com>

On Wed, Apr 26, 2017 at 05:03:44PM +0100, Suzuki K Poulose wrote:
> On 25/04/17 19:49, Radim Krčmář wrote:
> >2017-04-24 11:10+0100, Suzuki K Poulose:
> >>The KVM uses mmu_notifier (wherever available) to keep track
> >>of the changes to the mm of the guest. The guest shadow page
> >>tables are released when the VM exits via mmu_notifier->ops.release().
> >>There is a rare chance that the mmu_notifier->release could be
> >>called more than once via two different paths, which could end
> >>up in use-after-free of kvm instance (such as [0]).
> >>
> >>e.g:
> >>
> >>thread A                                        thread B
> >>-------                                         --------------
> >>
> >> get_signal->                                   kvm_destroy_vm()->
> >> do_exit->                                        mmu_notifier_unregister->
> >> exit_mm->                                        kvm_arch_flush_shadow_all()->
> >> exit_mmap->                                      spin_lock(&kvm->mmu_lock)
> >> mmu_notifier_release->                           ....
> >>  kvm_arch_flush_shadow_all()->                   .....
> >>  ... spin_lock(&kvm->mmu_lock)                   .....
> >>                                                  spin_unlock(&kvm->mmu_lock)
> >>                                                kvm_arch_free_kvm()
> >>   *** use after free of kvm ***
> >
> >I don't understand this race ...
> >a piece of code in mmu_notifier_unregister() says:
> >
> >  	/*
> >  	 * Wait for any running method to finish, of course including
> >  	 * ->release if it was run by mmu_notifier_release instead of us.
> >  	 */
> >  	synchronize_srcu(&srcu);
> >
> >and code before that removes the notifier from the list, so it cannot be
> >called after we pass this point.  mmu_notifier_release() does roughly
> >the same and explains it as:
> >
> >  	/*
> >  	 * synchronize_srcu here prevents mmu_notifier_release from returning to
> >  	 * exit_mmap (which would proceed with freeing all pages in the mm)
> >  	 * until the ->release method returns, if it was invoked by
> >  	 * mmu_notifier_unregister.
> >  	 *
> >  	 * The mmu_notifier_mm can't go away from under us because one mm_count
> >  	 * is held by exit_mmap.
> >  	 */
> >  	synchronize_srcu(&srcu);
> >
> >The call of mmu_notifier->release is protected by srcu in both cases and
> >while it seems possible that mmu_notifier->release would be called
> >twice, I don't see a combination that could result in use-after-free
> >from mmu_notifier_release after mmu_notifier_unregister() has returned.
> 
> Thanks for bringing it up. Even I am wondering why this is triggered ! (But it
> does get triggered for sure !!)
> 
> The only difference I can spot with _unregister & _release paths are the way
> we use src_read_lock across the deletion of the entry from the list.
> 
> In mmu_notifier_unregister() we do :
> 
>                 id = srcu_read_lock(&srcu);
>                 /*
>                  * exit_mmap will block in mmu_notifier_release to guarantee
>                  * that ->release is called before freeing the pages.
>                  */
>                 if (mn->ops->release)
>                         mn->ops->release(mn, mm);
>                 srcu_read_unlock(&srcu, id);
> 
> ## Releases the srcu lock here and then goes on to grab the spin_lock.
> 
>                 spin_lock(&mm->mmu_notifier_mm->lock);
>                 /*
>                  * Can not use list_del_rcu() since __mmu_notifier_release
>                  * can delete it before we hold the lock.
>                  */
>                 hlist_del_init_rcu(&mn->hlist);
>                 spin_unlock(&mm->mmu_notifier_mm->lock);
> 
> While in mmu_notifier_release() we hold it until the node(s) are deleted from the
> list :
>         /*
>          * SRCU here will block mmu_notifier_unregister until
>          * ->release returns.
>          */
>         id = srcu_read_lock(&srcu);
>         hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
>                 /*
>                  * If ->release runs before mmu_notifier_unregister it must be
>                  * handled, as it's the only way for the driver to flush all
>                  * existing sptes and stop the driver from establishing any more
>                  * sptes before all the pages in the mm are freed.
>                  */
>                 if (mn->ops->release)
>                         mn->ops->release(mn, mm);
> 
>         spin_lock(&mm->mmu_notifier_mm->lock);
>         while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
>                 mn = hlist_entry(mm->mmu_notifier_mm->list.first,
>                                  struct mmu_notifier,
>                                  hlist);
>                 /*
>                  * We arrived before mmu_notifier_unregister so
>                  * mmu_notifier_unregister will do nothing other than to wait
>                  * for ->release to finish and for mmu_notifier_unregister to
>                  * return.
>                  */
>                 hlist_del_init_rcu(&mn->hlist);
>         }
>         spin_unlock(&mm->mmu_notifier_mm->lock);
>         srcu_read_unlock(&srcu, id);
> 
> ## The lock is release only after the deletion of the node.
> 
> Both are followed by a synchronize_srcu(). Now, I am wondering if the unregister path
> could potentially miss SRCU read lock held in _release() path and go onto finish the
> synchronize_srcu before the item is deleted ? May be we should do the read_unlock
> after the deletion of the node in _unregister (like we do in the _release()) ?
> 
> >
> >Doesn't [2/2] solve the exact same issue (that the release method cannot
> >be called twice in parallel)?
> 
> Not really. This could be a race between a release() and one of the other notifier
> callbacks. e.g, In [0], we were hitting a use-after-free in kvm_unmap_hva() where,
> the unregister could have succeeded and released the KVM.
> 
> 
> [0] http://lkml.kernel.org/r/febea966-3767-21ff-3c40-1a76d1399138@suse.de
> 
> In effect this all could be due to the same reason, the synchronize in unregister
> missing another reader.

If this is at all reproducible, I suggest use of ftrace or event tracing
to work out exactly what is happening.

							Thanx, Paul

WARNING: multiple messages have this Message-ID (diff)

From: paulmck@linux.vnet.ibm.com (Paul E. McKenney)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH 1/2] kvm: Fix mmu_notifier release race
Date: Wed, 26 Apr 2017 09:17:51 -0700	[thread overview]
Message-ID: <20170426161751.GV3956@linux.vnet.ibm.com> (raw)
In-Reply-To: <c1232b1d-ad82-794b-1b86-4d0cc0d4cd7f@arm.com>

On Wed, Apr 26, 2017 at 05:03:44PM +0100, Suzuki K Poulose wrote:
> On 25/04/17 19:49, Radim Kr?m?? wrote:
> >2017-04-24 11:10+0100, Suzuki K Poulose:
> >>The KVM uses mmu_notifier (wherever available) to keep track
> >>of the changes to the mm of the guest. The guest shadow page
> >>tables are released when the VM exits via mmu_notifier->ops.release().
> >>There is a rare chance that the mmu_notifier->release could be
> >>called more than once via two different paths, which could end
> >>up in use-after-free of kvm instance (such as [0]).
> >>
> >>e.g:
> >>
> >>thread A                                        thread B
> >>-------                                         --------------
> >>
> >> get_signal->                                   kvm_destroy_vm()->
> >> do_exit->                                        mmu_notifier_unregister->
> >> exit_mm->                                        kvm_arch_flush_shadow_all()->
> >> exit_mmap->                                      spin_lock(&kvm->mmu_lock)
> >> mmu_notifier_release->                           ....
> >>  kvm_arch_flush_shadow_all()->                   .....
> >>  ... spin_lock(&kvm->mmu_lock)                   .....
> >>                                                  spin_unlock(&kvm->mmu_lock)
> >>                                                kvm_arch_free_kvm()
> >>   *** use after free of kvm ***
> >
> >I don't understand this race ...
> >a piece of code in mmu_notifier_unregister() says:
> >
> >  	/*
> >  	 * Wait for any running method to finish, of course including
> >  	 * ->release if it was run by mmu_notifier_release instead of us.
> >  	 */
> >  	synchronize_srcu(&srcu);
> >
> >and code before that removes the notifier from the list, so it cannot be
> >called after we pass this point.  mmu_notifier_release() does roughly
> >the same and explains it as:
> >
> >  	/*
> >  	 * synchronize_srcu here prevents mmu_notifier_release from returning to
> >  	 * exit_mmap (which would proceed with freeing all pages in the mm)
> >  	 * until the ->release method returns, if it was invoked by
> >  	 * mmu_notifier_unregister.
> >  	 *
> >  	 * The mmu_notifier_mm can't go away from under us because one mm_count
> >  	 * is held by exit_mmap.
> >  	 */
> >  	synchronize_srcu(&srcu);
> >
> >The call of mmu_notifier->release is protected by srcu in both cases and
> >while it seems possible that mmu_notifier->release would be called
> >twice, I don't see a combination that could result in use-after-free
> >from mmu_notifier_release after mmu_notifier_unregister() has returned.
> 
> Thanks for bringing it up. Even I am wondering why this is triggered ! (But it
> does get triggered for sure !!)
> 
> The only difference I can spot with _unregister & _release paths are the way
> we use src_read_lock across the deletion of the entry from the list.
> 
> In mmu_notifier_unregister() we do :
> 
>                 id = srcu_read_lock(&srcu);
>                 /*
>                  * exit_mmap will block in mmu_notifier_release to guarantee
>                  * that ->release is called before freeing the pages.
>                  */
>                 if (mn->ops->release)
>                         mn->ops->release(mn, mm);
>                 srcu_read_unlock(&srcu, id);
> 
> ## Releases the srcu lock here and then goes on to grab the spin_lock.
> 
>                 spin_lock(&mm->mmu_notifier_mm->lock);
>                 /*
>                  * Can not use list_del_rcu() since __mmu_notifier_release
>                  * can delete it before we hold the lock.
>                  */
>                 hlist_del_init_rcu(&mn->hlist);
>                 spin_unlock(&mm->mmu_notifier_mm->lock);
> 
> While in mmu_notifier_release() we hold it until the node(s) are deleted from the
> list :
>         /*
>          * SRCU here will block mmu_notifier_unregister until
>          * ->release returns.
>          */
>         id = srcu_read_lock(&srcu);
>         hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
>                 /*
>                  * If ->release runs before mmu_notifier_unregister it must be
>                  * handled, as it's the only way for the driver to flush all
>                  * existing sptes and stop the driver from establishing any more
>                  * sptes before all the pages in the mm are freed.
>                  */
>                 if (mn->ops->release)
>                         mn->ops->release(mn, mm);
> 
>         spin_lock(&mm->mmu_notifier_mm->lock);
>         while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
>                 mn = hlist_entry(mm->mmu_notifier_mm->list.first,
>                                  struct mmu_notifier,
>                                  hlist);
>                 /*
>                  * We arrived before mmu_notifier_unregister so
>                  * mmu_notifier_unregister will do nothing other than to wait
>                  * for ->release to finish and for mmu_notifier_unregister to
>                  * return.
>                  */
>                 hlist_del_init_rcu(&mn->hlist);
>         }
>         spin_unlock(&mm->mmu_notifier_mm->lock);
>         srcu_read_unlock(&srcu, id);
> 
> ## The lock is release only after the deletion of the node.
> 
> Both are followed by a synchronize_srcu(). Now, I am wondering if the unregister path
> could potentially miss SRCU read lock held in _release() path and go onto finish the
> synchronize_srcu before the item is deleted ? May be we should do the read_unlock
> after the deletion of the node in _unregister (like we do in the _release()) ?
> 
> >
> >Doesn't [2/2] solve the exact same issue (that the release method cannot
> >be called twice in parallel)?
> 
> Not really. This could be a race between a release() and one of the other notifier
> callbacks. e.g, In [0], we were hitting a use-after-free in kvm_unmap_hva() where,
> the unregister could have succeeded and released the KVM.
> 
> 
> [0] http://lkml.kernel.org/r/febea966-3767-21ff-3c40-1a76d1399138 at suse.de
> 
> In effect this all could be due to the same reason, the synchronize in unregister
> missing another reader.

If this is at all reproducible, I suggest use of ftrace or event tracing
to work out exactly what is happening.

							Thanx, Paul

next prev parent reply	other threads:[~2017-04-26 16:17 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 10:10 [PATCH 0/2] kvm: Fixes for race conditions Suzuki K Poulose
2017-04-24 10:10 ` Suzuki K Poulose
2017-04-24 10:10 ` Suzuki K Poulose
2017-04-24 10:10 ` [PATCH 1/2] kvm: Fix mmu_notifier release race Suzuki K Poulose
2017-04-24 10:10   ` Suzuki K Poulose
2017-04-24 10:10   ` Suzuki K Poulose
2017-04-25 15:37   ` Christoffer Dall
2017-04-25 15:37     ` Christoffer Dall
2017-04-25 15:37     ` Christoffer Dall
2017-04-25 18:49   ` Radim Krčmář
2017-04-25 18:49     ` Radim Krčmář
2017-04-25 18:49     ` Radim Krčmář
2017-04-26 16:03     ` Suzuki K Poulose
2017-04-26 16:03       ` Suzuki K Poulose
2017-04-26 16:17       ` Paul E. McKenney [this message]
2017-04-26 16:17         ` Paul E. McKenney
2017-04-28 17:20       ` Suzuki K Poulose
2017-04-28 17:20         ` Suzuki K Poulose
2017-04-28 17:20         ` Suzuki K Poulose
2017-05-03 13:13         ` Suzuki K Poulose
2017-05-03 13:13           ` Suzuki K Poulose
2017-05-03 13:13           ` Suzuki K Poulose
2017-04-24 10:10 ` [PATCH 2/2] kvm: arm/arm64: Fix race in resetting stage2 PGD Suzuki K Poulose
2017-04-24 10:10   ` Suzuki K Poulose
2017-04-24 10:10   ` Suzuki K Poulose
2017-04-24 12:27   ` Christoffer Dall
2017-04-24 12:27     ` Christoffer Dall
2017-04-24 12:27     ` Christoffer Dall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170426161751.GV3956@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=Suzuki.Poulose@arm.com \
    --cc=Will.Deacon@arm.com \
    --cc=andreyknvl@google.com \
    --cc=christoffer.dall@linaro.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marc.zyngier@arm.com \
    --cc=mark.rutland@arm.com \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.